Re: About Tags, a proposed data-model

Allen Gilliland Mon, 30 Jan 2006 10:49:38 -0800

I agree that is true *if* results are cached, but therein lies the problem.


I have spent quite a bit of time working on caching and performance in Roller 
and with our current setup it's not caching that is hard, it's having a big 
enough cache that's hard.  The fact is that a blog takes up a lot of space 
because you have to consider caching entries, comments, bookmarks & folders, 
categories, referers, and templates.  As a blog grows so do most of those 
things, especially the entries & comments.  On top of just caching those 
objects we currently cache fully rendered pages and feeds, so that means a 
handful of xml feeds and quite a few full html pages.  The point being, on a 
large site there is tons of data that needs caching already without having to 
cache tag related data.

Currently, I have no idea how we can expect to cache all the data that would be 
needed for a full tagging system along with everything else we cache right now.

-- Allen


On Mon, 2006-01-30 at 07:07, John Hoffmann wrote:
> I'd just like to add that performing joins in sql is not something to be 
> avoided,  the impact can be almost completely mitigated by caching the 
> results.   The only real cause for concern is for truly massive datasets 
> in which the join cannot be performed in the amount of memory available 
> to the database.
> 
> -John
> 
> Allen Gilliland wrote:
> 
> >I don't want to lose this thread because I think there are still some ideas 
> >to continue flushing out.  More comments inline ...
> >
> >On Fri, 2006-01-13 at 05:14, David Levy wrote:
> >  
> >
> >>Sorry to have taken so long.
> >>
> >>The denormalisations of the author_name (which may be owner name) and 
> >>entrydate are to support queries.  This is because I expect a macro to 
> >>create a tag cloud for a user so that the html versions can have the tag 
> >>cloud,
> >>
> >>So
> >>
> >>select normal_tag_name, count(*)
> >>from entry2tags
> >>where author_name = "DaveLevy"
> >>group by normal_tag_name
> >>
> >>gives us the data required for a tag cloud, for a single blog . No join 
> >>as you can see, where it gets fun is if you want a hot tags cloud
> >>
> >>we add a line so the query becomes
> >>
> >>select normal_tag_name, count(*)
> >>from entry2tags
> >>where author_name = "DaveLevy"
> >>and     entry_date > @sevedaysago
> >>group by normal_tag_name
> >>    
> >>
> >
> >very cool stuff ... i like the looks of that.
> >
> >... lots of stuff chopped out here ...
> >
> >  
> >
> >>>>is there a reason to copy down the entry_date rather than access it via a 
> >>>>join on the weblogentry table?  you can join with the weblogentry table 
> >>>>using the entry_id column.  how were you planning to use the entry_date 
> >>>>field?
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>see above,  I don't want to join to the entry table, and this goal also 
> >>impacts my index design.  entry_date allows hot tags queries to be 
> >>driven by the entry not the tagged date
> >>    
> >>
> >
> >ok.  I agree that joins are a likely performance problem, but my next 
> >question then becomes ... How do we plan to deal with getting the data for 
> >the list of entries marked with a given tag?  I am expecting that when 
> >someone uses the tag dashboard or a tag cloud to try and view a list of 
> >entries with the tag "foo", that list will look something like the Roller 
> >front page.  example ...
> >
> >url = /roller/tag/entries/java+netbeans
> >
> >you then populate a page with 50? 25? entry summaries for people to browse 
> >through and those summaries will at least require the entry title and a 
> >summary of the entry content and may also require the entry date, category, 
> >author, and weblog title.  I would think we are going to require a join to 
> >get that data.
> >
> >  
> >
> >>>>what acts as the primary key?  (author_name, entry_id, user_tag_name)?  
> >>>>
> >>>>        
> >>>>
> >>my PK is author_name, entry_id and normal_tag_name
> >>
> >>    
> >>
> >>>>we may still need to use a surrogate key to uniquely identify the row to 
> >>>>avoid having a multi column primary key. 
> >>>>
> >>>>        
> >>>>
> >>yeah, looks like it
> >>
> >>    
> >>
> >>>>i'm not sure that would be much of an issue though because it doesn't 
> >>>>look like you are planning for any joins for the tag names, correct?
> >>>>
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>I'm trying to avoid any joins, but if you are looking for entries on a 
> >>blog and tagged, then it would be good to enter the query on 
> >>author_name, but since we have not copied the title down to the 
> >>entry2tags table we still need the join and can go in on author_name on 
> >>either table, but best do it on entry table (see below).
> >>
> >>select  entry.title
> >>from    entry, entry2tags e2t
> >>where e2t.entry_id = entry.id
> >>and     entry.author_name = @KnownName
> >>and     e2t.normal_tag_name in (@TagQueryList)
> >>    
> >>
> >
> >based on my example above, how would we get the necessary entry data when we 
> >don't know the author name because we are searching through the entire tag 
> >system, not just from a single author or weblog?
> >
> >-- Allen
> >
> >
> >  
> >
> >>>Right, Allen. This is very similar to what we already have. I'm not sure
> >>>why having author name here, when we already have that through a join
> >>>with the entry_id.
> >>>
> >>>      
> >>>
> >>That's right, but I don't want to join, the entry2tags table is big 
> >>enough without joining.
> >>
> >>    
> >>
> >>>I'm not sure why do we need entry_date when we have
> >>>the tagging date. 
> >>>
> >>>      
> >>>
> >>I think the queries should be driven through the entry publication date.
> >>
> >>    
> >>
> >>>I do like the normal tag name. I was thinking this too for the output of 
> >>>the Porter Stemming algorithm so I wouldn't lose the original information 
> >>>entered by the user. Everything else is the same :-).
> >>> 
> >>>
> >>>      
> >>>
> >>I need to read your references to understand this, but I think you agree 
> >>this is OK
> >>
> >>    
> >>
> >>>Another point I want to make is the fact, that we do a little bit of
> >>>extra processing when saving an existing entry that make sure it keeps
> >>>the original date for each tag. For example, when I first created an
> >>>entry I tagged it A,B,C. The first time I edited, I removed B. The A and
> >>>C tags will retain the dates when they were added as opposed to the edit
> >>>date.
> >>> 
> >>>
> >>>      
> >>>
> >>Are you holding entry_tag data on the entry table?
> >>
> >>    
> >>
> >>>Additionally, we have a question of what to do with spaces. Should tags
> >>>be multi-word or not? My suggestion to Phay (one of our developers) was
> >>>to use spaces as separators in the input field, therefore not supporting
> >>>spaces. But we could do multiple things to support spaces, such as
> >>>quoting multi-word tags. I believe Flickr supports multi-words but they
> >>>remove the spaces from the tags, but technorati does maintain spaces. I
> >>>don't like them myself, because I think it fragments the tag space much
> >>>more than single words and you could still use intersections to get the
> >>>sort of the same result.
> >>>
> >>>
> >>> 
> >>>
> >>>      
> >>>
> >>>>>create constraint defined_tags_name as
> >>>>>   normal_tag_name=proper(user_tag_name)
> >>>>>
> >>>>>create constraint defined_tags_entry_date as
> >>>>>   entry_date = select entry_date from entry
> >>>>>                          where definedtags.entry_name = entry.entry_name
> >>>>>
> >>>>>and the following indexes
> >>>>>
> >>>>>create dt.tags on definedtags
> >>>>>   as author_name, entry_name, normal_tag_name unique
> >>>>>     
> >>>>>
> >>>>>          
> >>>>>
> >>This is the real primary key
> >>
> >>    
> >>
> >>>>>create dt.tags2 on definedtags
> >>>>>   as normal_tag_name
> >>>>>     
> >>>>>
> >>>>>          
> >>>>>
> >>This is the tag entity (or operational master)
> >>
> >>    
> >>
> >>>>>create dt.entries on definedtags
> >>>>>   as author_name, entry_name
> >>>>>
> >>>>>create dt.date_written
> >>>>>   on definedtags as entry_date.
> >>>>>
> >>>>>I have written this in a hurry so it may not be though out as well as
> >>>>>some of my writing, but hopefully this is collaborative development.
> >>>>>Also I have difficulty in commenting on and reading some of the
> >>>>>front-end & java orientated stuff. (I have ordered a couple of books to
> >>>>>help me catchup). Hopefully this is helpfull
> >>>>>
> >>>>>     
> >>>>>
> >>>>>          
> >>>>>
> >>>>I think this all makes sense to me so far and I certainly appreciate the 
> >>>>help.  I think getting the data model correct is a *very* important issue 
> >>>>before we move forward with implemenatation, so I'm glad we are having 
> >>>>this discussion.
> >>>>
> >>>>-- Allen
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>[1] http://www.tartarus.org/~martin/PorterStemmer/def.txt
> >>>[2] http://torrez.us/archives/2005/07/13/tagrank.pdf
> >>>
> >>> 
> >>>
> >>>      
> >>>
> >>>>>Elias Torres wrote:
> >>>>>
> >>>>>     
> >>>>>
> >>>>>          
> >>>>>
> >>>>>>Welcome David to the Roller list.
> >>>>>>
> >>>>>>Thank you for your post. I have read your blog post on a tag data
> >>>>>>model for Roller. I'm looking forward to your relational algebra and
> >>>>>>query cost analysis. I wanted to tell you that we (IBM) have already
> >>>>>>added basic tagging support to Roller and it actually supports a
> >>>>>>TagCloud. I am supposed to put a proposal in the roller wiki so others
> >>>>>>could comment and once I do that, you could put your comments there as
> >>>>>>well.
> >>>>>>
> >>>>>>Just to kickstart the conversation I'm including the tagging table we
> >>>>>>are currently using.
> >>>>>>
> >>>>>>create table weblogentrytag (
> >>>>>>  id              varchar(48)   not null primary key,
> >>>>>>  entryid        varchar(48)   not null,
> >>>>>>  name            varchar(255)  not null,
> >>>>>>  tagtime         timestamp     not null
> >>>>>>);
> >>>>>>
> >>>>>>We have basically two tables: entries and entry2tags, but are missing
> >>>>>>a tag table. At first, I was very set on having a tag table and use a
> >>>>>>foreign key to "save" space on repetitive tag names. But I was shown
> >>>>>>it's not really a big space saving technique, especially since tag
> >>>>>>names are relatively short storing a guid or int would almost be
> >>>>>>comparable in space. There are also increased costs in inserting and
> >>>>>>joining on tables to get tag names if using a foreign key, so we have
> >>>>>>settle on this for now until we have other queries requirements. I'll
> >>>>>>be summarizing all of our changes to roller to support tagging in a
> >>>>>>wiki proposal soon.
> >>>>>>
> >>>>>>Regarding the use of the list, some people have been using nabble.com
> >>>>>>to interact with it. Maybe you can give it a try. I simply use gmail.
> >>>>>>
> >>>>>>http://www.nabble.com/Roller-f12275.html
> >>>>>>
> >>>>>>Regards,
> >>>>>>
> >>>>>>Elias
> >>>>>>
> >>>>>>On 1/4/06, David Levy <[EMAIL PROTECTED]> wrote:
> >>>>>>
> >>>>>>
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>>>>I have documented a data model for tags. This is held at my blog
> >>>>>>>
> >>>>>>>http://blogs.sun.com/roller/page/DaveLevy?entry=implementing_tags_in_a_database
> >>>>>>>
> >>>>>>>I have a graphic demonstrating the relationship between authors,
> >>>>>>>articles and tags and illustrating the first and obvious indexes. (I
> >>>>>>>have identified that both "Date Published" and tag aggregates are
> >>>>>>>missing from the model).  Since the model was built to help me
> >>>>>>>understand del.icio.us, I call the entities Users, Bookmarks and Tags,
> >>>>>>>but hopefully its simple to see that these are pretty synonomous to
> >>>>>>>authors, articles and tags.
> >>>>>>>
> >>>>>>>I hope that this is useful for those looking at implementing tags.
> >>>>>>>
> >>>>>>>I am still working out how to use the mail-list, so I hope that 
> >>>>>>>x-refing you to my blog isn't deprecated. I also need to work how to 
> >>>>>>>maintain thread connections i.e. undertake a reply.
> >>>>>>>--
> >>>>>>>
> >>>>>>>Dave
> >>>>>>>         
> >>>>>>>
> >>>>>>>              
> >>>>>>>
> >>>>>--
> >>>>>
> >>>>>Dave
> >>>>>
> >>>>>     
> >>>>>
> >>>>>          
> >>>>>
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>-- 
> >>
> >>Dave
> >>
> >><http://www.sun.com>        * David Levy *
> >>
> >>
> >>
> >>
> >>
> >>
> >>Blog http://blogs.sun.com/DaveLevy
> >>Email [EMAIL PROTECTED]
> >>
> >>Sun Proprietary & Confidential . This e-mail message is for the sole use 
> >>of the intended recipient(s) and may contain confidential and 
> >>privilidged information. Any unauthorised review, use, disclosure or 
> >>distribution is prohibited. If you are not the intended recepient, 
> >>please contact the sender by reply e-mail and destroy all copies of the 
> >>original message.
> >>
> >>    
> >>
> >
> >  
> >
>

Re: About Tags, a proposed data-model

Reply via email to