I agree that is true *if* results are cached, but therein lies the problem.
I have spent quite a bit of time working on caching and performance in Roller and with our current setup it's not caching that is hard, it's having a big enough cache that's hard. The fact is that a blog takes up a lot of space because you have to consider caching entries, comments, bookmarks & folders, categories, referers, and templates. As a blog grows so do most of those things, especially the entries & comments. On top of just caching those objects we currently cache fully rendered pages and feeds, so that means a handful of xml feeds and quite a few full html pages. The point being, on a large site there is tons of data that needs caching already without having to cache tag related data. Currently, I have no idea how we can expect to cache all the data that would be needed for a full tagging system along with everything else we cache right now. -- Allen On Mon, 2006-01-30 at 07:07, John Hoffmann wrote: > I'd just like to add that performing joins in sql is not something to be > avoided, the impact can be almost completely mitigated by caching the > results. The only real cause for concern is for truly massive datasets > in which the join cannot be performed in the amount of memory available > to the database. > > -John > > Allen Gilliland wrote: > > >I don't want to lose this thread because I think there are still some ideas > >to continue flushing out. More comments inline ... > > > >On Fri, 2006-01-13 at 05:14, David Levy wrote: > > > > > >>Sorry to have taken so long. > >> > >>The denormalisations of the author_name (which may be owner name) and > >>entrydate are to support queries. This is because I expect a macro to > >>create a tag cloud for a user so that the html versions can have the tag > >>cloud, > >> > >>So > >> > >>select normal_tag_name, count(*) > >>from entry2tags > >>where author_name = "DaveLevy" > >>group by normal_tag_name > >> > >>gives us the data required for a tag cloud, for a single blog . No join > >>as you can see, where it gets fun is if you want a hot tags cloud > >> > >>we add a line so the query becomes > >> > >>select normal_tag_name, count(*) > >>from entry2tags > >>where author_name = "DaveLevy" > >>and entry_date > @sevedaysago > >>group by normal_tag_name > >> > >> > > > >very cool stuff ... i like the looks of that. > > > >... lots of stuff chopped out here ... > > > > > > > >>>>is there a reason to copy down the entry_date rather than access it via a > >>>>join on the weblogentry table? you can join with the weblogentry table > >>>>using the entry_id column. how were you planning to use the entry_date > >>>>field? > >>>> > >>>> > >>>> > >>>> > >>see above, I don't want to join to the entry table, and this goal also > >>impacts my index design. entry_date allows hot tags queries to be > >>driven by the entry not the tagged date > >> > >> > > > >ok. I agree that joins are a likely performance problem, but my next > >question then becomes ... How do we plan to deal with getting the data for > >the list of entries marked with a given tag? I am expecting that when > >someone uses the tag dashboard or a tag cloud to try and view a list of > >entries with the tag "foo", that list will look something like the Roller > >front page. example ... > > > >url = /roller/tag/entries/java+netbeans > > > >you then populate a page with 50? 25? entry summaries for people to browse > >through and those summaries will at least require the entry title and a > >summary of the entry content and may also require the entry date, category, > >author, and weblog title. I would think we are going to require a join to > >get that data. > > > > > > > >>>>what acts as the primary key? (author_name, entry_id, user_tag_name)? > >>>> > >>>> > >>>> > >>my PK is author_name, entry_id and normal_tag_name > >> > >> > >> > >>>>we may still need to use a surrogate key to uniquely identify the row to > >>>>avoid having a multi column primary key. > >>>> > >>>> > >>>> > >>yeah, looks like it > >> > >> > >> > >>>>i'm not sure that would be much of an issue though because it doesn't > >>>>look like you are planning for any joins for the tag names, correct? > >>>> > >>>> > >>>> > >>>> > >>>> > >>I'm trying to avoid any joins, but if you are looking for entries on a > >>blog and tagged, then it would be good to enter the query on > >>author_name, but since we have not copied the title down to the > >>entry2tags table we still need the join and can go in on author_name on > >>either table, but best do it on entry table (see below). > >> > >>select entry.title > >>from entry, entry2tags e2t > >>where e2t.entry_id = entry.id > >>and entry.author_name = @KnownName > >>and e2t.normal_tag_name in (@TagQueryList) > >> > >> > > > >based on my example above, how would we get the necessary entry data when we > >don't know the author name because we are searching through the entire tag > >system, not just from a single author or weblog? > > > >-- Allen > > > > > > > > > >>>Right, Allen. This is very similar to what we already have. I'm not sure > >>>why having author name here, when we already have that through a join > >>>with the entry_id. > >>> > >>> > >>> > >>That's right, but I don't want to join, the entry2tags table is big > >>enough without joining. > >> > >> > >> > >>>I'm not sure why do we need entry_date when we have > >>>the tagging date. > >>> > >>> > >>> > >>I think the queries should be driven through the entry publication date. > >> > >> > >> > >>>I do like the normal tag name. I was thinking this too for the output of > >>>the Porter Stemming algorithm so I wouldn't lose the original information > >>>entered by the user. Everything else is the same :-). > >>> > >>> > >>> > >>> > >>I need to read your references to understand this, but I think you agree > >>this is OK > >> > >> > >> > >>>Another point I want to make is the fact, that we do a little bit of > >>>extra processing when saving an existing entry that make sure it keeps > >>>the original date for each tag. For example, when I first created an > >>>entry I tagged it A,B,C. The first time I edited, I removed B. The A and > >>>C tags will retain the dates when they were added as opposed to the edit > >>>date. > >>> > >>> > >>> > >>> > >>Are you holding entry_tag data on the entry table? > >> > >> > >> > >>>Additionally, we have a question of what to do with spaces. Should tags > >>>be multi-word or not? My suggestion to Phay (one of our developers) was > >>>to use spaces as separators in the input field, therefore not supporting > >>>spaces. But we could do multiple things to support spaces, such as > >>>quoting multi-word tags. I believe Flickr supports multi-words but they > >>>remove the spaces from the tags, but technorati does maintain spaces. I > >>>don't like them myself, because I think it fragments the tag space much > >>>more than single words and you could still use intersections to get the > >>>sort of the same result. > >>> > >>> > >>> > >>> > >>> > >>> > >>>>>create constraint defined_tags_name as > >>>>> normal_tag_name=proper(user_tag_name) > >>>>> > >>>>>create constraint defined_tags_entry_date as > >>>>> entry_date = select entry_date from entry > >>>>> where definedtags.entry_name = entry.entry_name > >>>>> > >>>>>and the following indexes > >>>>> > >>>>>create dt.tags on definedtags > >>>>> as author_name, entry_name, normal_tag_name unique > >>>>> > >>>>> > >>>>> > >>>>> > >>This is the real primary key > >> > >> > >> > >>>>>create dt.tags2 on definedtags > >>>>> as normal_tag_name > >>>>> > >>>>> > >>>>> > >>>>> > >>This is the tag entity (or operational master) > >> > >> > >> > >>>>>create dt.entries on definedtags > >>>>> as author_name, entry_name > >>>>> > >>>>>create dt.date_written > >>>>> on definedtags as entry_date. > >>>>> > >>>>>I have written this in a hurry so it may not be though out as well as > >>>>>some of my writing, but hopefully this is collaborative development. > >>>>>Also I have difficulty in commenting on and reading some of the > >>>>>front-end & java orientated stuff. (I have ordered a couple of books to > >>>>>help me catchup). Hopefully this is helpfull > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>I think this all makes sense to me so far and I certainly appreciate the > >>>>help. I think getting the data model correct is a *very* important issue > >>>>before we move forward with implemenatation, so I'm glad we are having > >>>>this discussion. > >>>> > >>>>-- Allen > >>>> > >>>> > >>>> > >>>> > >>>[1] http://www.tartarus.org/~martin/PorterStemmer/def.txt > >>>[2] http://torrez.us/archives/2005/07/13/tagrank.pdf > >>> > >>> > >>> > >>> > >>> > >>>>>Elias Torres wrote: > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>>>Welcome David to the Roller list. > >>>>>> > >>>>>>Thank you for your post. I have read your blog post on a tag data > >>>>>>model for Roller. I'm looking forward to your relational algebra and > >>>>>>query cost analysis. I wanted to tell you that we (IBM) have already > >>>>>>added basic tagging support to Roller and it actually supports a > >>>>>>TagCloud. I am supposed to put a proposal in the roller wiki so others > >>>>>>could comment and once I do that, you could put your comments there as > >>>>>>well. > >>>>>> > >>>>>>Just to kickstart the conversation I'm including the tagging table we > >>>>>>are currently using. > >>>>>> > >>>>>>create table weblogentrytag ( > >>>>>> id varchar(48) not null primary key, > >>>>>> entryid varchar(48) not null, > >>>>>> name varchar(255) not null, > >>>>>> tagtime timestamp not null > >>>>>>); > >>>>>> > >>>>>>We have basically two tables: entries and entry2tags, but are missing > >>>>>>a tag table. At first, I was very set on having a tag table and use a > >>>>>>foreign key to "save" space on repetitive tag names. But I was shown > >>>>>>it's not really a big space saving technique, especially since tag > >>>>>>names are relatively short storing a guid or int would almost be > >>>>>>comparable in space. There are also increased costs in inserting and > >>>>>>joining on tables to get tag names if using a foreign key, so we have > >>>>>>settle on this for now until we have other queries requirements. I'll > >>>>>>be summarizing all of our changes to roller to support tagging in a > >>>>>>wiki proposal soon. > >>>>>> > >>>>>>Regarding the use of the list, some people have been using nabble.com > >>>>>>to interact with it. Maybe you can give it a try. I simply use gmail. > >>>>>> > >>>>>>http://www.nabble.com/Roller-f12275.html > >>>>>> > >>>>>>Regards, > >>>>>> > >>>>>>Elias > >>>>>> > >>>>>>On 1/4/06, David Levy <[EMAIL PROTECTED]> wrote: > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>>>I have documented a data model for tags. This is held at my blog > >>>>>>> > >>>>>>>http://blogs.sun.com/roller/page/DaveLevy?entry=implementing_tags_in_a_database > >>>>>>> > >>>>>>>I have a graphic demonstrating the relationship between authors, > >>>>>>>articles and tags and illustrating the first and obvious indexes. (I > >>>>>>>have identified that both "Date Published" and tag aggregates are > >>>>>>>missing from the model). Since the model was built to help me > >>>>>>>understand del.icio.us, I call the entities Users, Bookmarks and Tags, > >>>>>>>but hopefully its simple to see that these are pretty synonomous to > >>>>>>>authors, articles and tags. > >>>>>>> > >>>>>>>I hope that this is useful for those looking at implementing tags. > >>>>>>> > >>>>>>>I am still working out how to use the mail-list, so I hope that > >>>>>>>x-refing you to my blog isn't deprecated. I also need to work how to > >>>>>>>maintain thread connections i.e. undertake a reply. > >>>>>>>-- > >>>>>>> > >>>>>>>Dave > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>-- > >>>>> > >>>>>Dave > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>> > >>>> > >>>> > >>>> > >>-- > >> > >>Dave > >> > >><http://www.sun.com> * David Levy * > >> > >> > >> > >> > >> > >> > >>Blog http://blogs.sun.com/DaveLevy > >>Email [EMAIL PROTECTED] > >> > >>Sun Proprietary & Confidential . This e-mail message is for the sole use > >>of the intended recipient(s) and may contain confidential and > >>privilidged information. Any unauthorised review, use, disclosure or > >>distribution is prohibited. If you are not the intended recepient, > >>please contact the sender by reply e-mail and destroy all copies of the > >>original message. > >> > >> > >> > > > > > > >
