Elias Torres wrote:
Allen Gilliland wrote:
Elias Torres wrote:
Allen Gilliland wrote:
Elias, there are a number of things in this commit that seem a
little strange to me ...
1. why do we need 3 separate tables for the tag aggregate data?
Actually we only need 2. The usertagagg is not going to be used.
The reason is that if we have n unique tags and m sites, then
queries for a site tag cloud would be against a table n x m rows
long vs n rows long. Weblog tagclouds can be answered quickly from
the weblogtagagg table by simply restricting the rows by websiteid
and taking every row. But the site tagcloud from the same table
would require reading the entire table. This might not be a big
deal, if you guys think is overkill, one could do.
I don't care if you keep a seperate record for the site-wide and
individual weblog aggregate data, but I think we can keep it all in
one table right? Basically, the site-wide version for a tag is a
weblogtagagg row with websiteid = NULL.
So the query for hot tags site-wide would be something like ...
select * from weblogtaggagg where websiteid = NULL order by cnt;
You are definitely right.
2. why do we need a SummariesTask to run asynchronously? i
thought the aggregate data was going to be compiled at the time
the tags were saved with the entry?
This is a very good question and I'll gladly change it if we have a
better option. However, I'm not sure if we can do it correctly.
Let me explain. If we have a new entry using an existent tag and I
make a query to get its current count and then increment it with
one won't I have problems with the way hibernate works? I'd think
that I would be clobbering values as multiple requests are coming
for the same tag since we are not "locking" the table and calling
flush() at different stages in the code. I used a task approach to
make sure I wouldn't lose any values and that we could easily
recompute the aggregate table through an admin setting if needed.
I think the likelihood of that is very slim. Since the data is
getting updated at entry publish time then the only chance for that
to occur is if 2 or more people are posting their entry at the exact
same time and using the exact same tags. Even then I would expect
Hibernate to account for that.
We do a far more impacting version of that same thing with the way
the current ReferrerManager works for counting referrers throughout
the day. That thing works the same way ... select RefererData, update
cnt, save. And it does that on every single request to a weblog, so
it's far more likely that we would get a collision problem there on a
high volume site then with the tag agg cnt.
k. sounds good then.
The other problem I see with doing it as a task is that then the data
isn't kept as fresh and you are spending a lot of extra time
iterating over every single tag and recomputing its count. I think I
would prefer if we try and stick with the original plan first and
update the aggregate data when the entry is actually saved.
It runs hourly (same as our cache anyways, right?) and it doesn't spend
a lot of extra time. It stores the last time it ran and selects only on
the new tags since and does all of the counting in the query. Anyways,
I'll switch it over to the referer way.
Well, that's not really how the cache works. The default cache is an
expiring cache, so content older than an hour is automatically
invalidated, but that doesn't mean it can't be invalidated before then.
I'm not even sure why we use that expiring cache by default anymore,
it's a relic of the old 1.x days and since we use our own custom cache
settings on BSC I've never bothered to mess with it. We should probably
be using the non-expiring cache which would simply mean that cached
content is valid until it's explicitly expired.
3. why do we need to move things into a separate TagManager
class?
I was chatting with Dave yesterday and initially wrote the
sumarizing code directly to hibernate but I then noticed that I
needed to abstract that. I asked which one he preferred: new
interface or add them to the same one. He leaned towards a new one,
but I can shove them into WeblogManager.
I don't follow that at all. Maybe you can start by explaining what
that method is for and why we need it, it's not documented in the
code.
getTags() was in the proposal and summarize() is the one used by the
task to summarize the tag table into the aggregate table. It doesn't
matter anymore since, I'll be removing the TagManager and use the
Referer approach.
Also, I am noticing that the formatting of your code is not
always consistent. For example, in the WeblogTagAggregateData
class you are mixing code formatting styles. Maybe that's just a
copy/paste issue, but I think it's important that as we add new
code we keep the formatting consistent.
I have set my IDE to convert tabs to spaces and using 2 spaces for
each tab. Maybe I need to change my settings. Sorry.
It looks like it was code that you copied and pasted and probably
didn't reformat. It's not that big of a deal, but it's definitely
nice if everyone can keep their code formatted nicely when adding it
to the repository so it doesn't have to be cleaned up later.
Sure. What's the coding convention so I can fix my IDE settings? tabs?
spaces? how many?
Just the straight up Java coding conventions ...
http://java.sun.com/docs/codeconv/
Once again, we should probably have this in a developers guide so that
it's not passed around as info after the fact. That's our fault for not
having that setup already.
I'll put that on my TODO list though and get a page put up on the wiki.
-- Allen
-Elias