Hi Doğacan,

OK, cool. Comments below:

>> My only question: why introduce a new data structure called “Markers” when
>> all that seems to be is a Metadata object. Let’s use
>> o.a.tika.metadata.Metadata to represent that? My only comment then would be,
>> aren’t we still doing something you mentioned you wanted to get rid of below,
>> where you said: “For example, during parsing we don't have access to a URL's
>> fetch status. So we copy fetch status into content metadata.” Aren’t we just
>> doing the same thing with Markers?
>> 
> 
> Actually, markers used to be stored in the metadata object in WebPage
> (metadata is a map from string to bytes). It just seemed clearer to me to put
> it into its own field. We can discuss if moving it back into metadata makes
> more sense.

Ok, gotcha, the creation of a new data structure cleanly identified your
intention that this just wasn't "Metadata" in its vanilla sense, but
actually a first class object called Markers for the purpose of tracking
where we are in a crawl cycle. Maybe we should be even more explicit then
and call this class: "CrawlCycleTracker". I think that would make it even
more explicit.

> 
> One thing: We can't use tika's metadata object as WebPage object is generated
> from an avro schema.

Not to push this, because in the end I don't feel super strongly about it,
but can't Avro handle data types such as Tika's metadata object? Or does it
only do primitive types?

> 
> As for your last comment: Markers are only used to identify where we are in a
> crawl cycle and the individual crawl ids. So during parse, when we get a URL
> during MapReduce, parse can easily check if that URL has been fetched in
> *that* crawl cycle (since there is no point in parsing it if it hasn't been
> fetched). So it is not used to pass any important information around. It is
> just a simple tracking system. Did this make it any clearer?

Gotcha. In the end you could use the o.a.tika.metadata.Metadata object for
this anyways, but it wouldn't really be Metadata in its purest sense, or it
would but it's a real specific kind of Metadata: it's a (set of) marker(s)
to identify where we are in a crawl cycle.

So, +1 to having an explicit data structure to track this. I think we should
consider renaming it though as per my comment above to make it even more
explicit.

Thanks for the explanation!

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: [email protected]
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


Reply via email to