Stefan Groschupf wrote:
Hi,
some thoughts about meta data.
We agree that we try to minimize the usage of meta data, to keep
performance high.
Since we descide to have meta data separated, I was thinking of a meta
data db as we have a crawl db today.
I asking my self where we will need meta data, so it makes sense to
have them separated or not.
My personal list:
[...]
As you point out, in many cases the additional metadata is needed
throughout most of the workflow. So, it would make more sense to keep it
together with CrawlDatum.
+ generation // having meta data here to decide if a page should be
fetched or not
+ fetching // here I'm not sure, my we need meta data for fecthing but
it may be would be great to store session or authentication
informations can be used until fetching.
Yes, that's a perfect example. Also, last modification time is required
to detect modified content.
However until fetching and parsing meta data for a url can be created.
+ updating // until updating i was planing to overwrite the old meta
data with the new data, I had the idea to use a system.currentmillis
as a stored timestamp to identify the newer meta data, but I have no
idea if the current millis are fast enough for the job, any thoughts?
Do we need versioning or timestamping of metadata? I can't imagine
why... we already store the last fetch time.
+ indexing // to add url meta data into the index.
Well, looking to this list, I'm more and more believe that it would be
a better idea to store the meta data into the CrawlDatum object
directly. It save a lot of code changes and we need meta data
everywhere anyway.
[...]
So why not adding meta data directly to crawlDatum?
I thought it was already decided ;-) . Yes, we need to do just that.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com