Re: [Nutch-dev] Thoughts on Parser design and dependencies

Sami Siren Fri, 18 Aug 2006 14:56:07 -0700

Andrzej Bialecki wrote:
> Sami Siren wrote:
>> Andrzej Bialecki wrote:
>>> Jukka Zitting wrote:
>>>
>>>> The Parser interface is also bound to the ideas of fetching content
>>>> from the network and indexing it using a standard content model
>>>> through the Content and Parse dependencies. For the Tika project I'd
>>>> like to look for ways to generalize this, as neither of these ideas
>>>> apply for example to the needs of the Apache Jackrabbit project. My
>>>> TextExtractor proposal avoids these dependencies by using just a
>>>> binary stream, a content type and an optional character encoding to
>>>> produce a single text stream, but that approach fails to support more
>>>> structured index content models. I'm trying to find a solution that
>>>> combines the best parts of both approaches.
>>>
>>> A very important aspect of the Parser interface (or actually, the 
>>> Parse and Content classes) is that they each may contain arbitrary 
>>> metadata. This is required for discovering and passing around both 
>>> the original metadata (such as protocol headers, document properties, 
>>> etc), and other secondary content (such as data from external 
>>> sources, or derived metadata).
>>>
>>> Simply returning a String doesn't cut it. Returning a java.util.Map 
>>> may be an option, if you use standard Metadata constants as keys - 
>>> still, Nutch would have to repackage this anyway into a Writable. And 
>>> we would lose a nice property of the current Metadata class, which is 
>>> the ability to tolerate minor syntax variations and to store multiple 
>>> values per key.
>>>
>> The tolerance for syntax variations should instead of written into 
>> meta data object be in a separate class perhaps implemented as a 
>> decorator to actual meta data. In fact places where nutch needs to 
>> take advantage of this functionality (actually in case of http headers 
>> only??) are rarer (in number) than those where we know exactly the 
>> names of meta data keys (because we put them there).
>>
>> I'd +1 if we'd go for a Map as a interface to meta data and in the 
>> same time perhaps change the Crawldb's metadata to the same meta data 
>> implementation or subclass of it.
> 
> Hmm. Please keep in mind that we need to use a Writable, both for the 
> Map itself and also for every value that we put there. I'm worried that 
> this could lead to excessive re-packaging of all objects coming out of 
> Parsers, from their original formats (Map<String, String>) to MapWritable.


Yes that is a potential problem. Especially from the efficiency point of 
view. One should test how much of a (performance) problem that actually is.

> Since the goal here is to get rid of dependencies on Nutch or Hadoop, 
> this means that Nutch will have to do such conversion because Tika would 
> not support Writable.
> 
>>
>> Perhaps we could even go for Map<String,String> or is there actually 
>> some use case for having multiple values for single key?
> 
> Original motivation for this was http headers and meta tags, which can 
> have multiple values. Another case is the language identification, where 
> the same key may have multiple values, coming from different sources. 
> Additionally, MapWritable supports any Writable, which is quite handy to 
> store non-string data and to avoid converting to/from strings.

I am not jerking on MapWritable, in fact I think it's quite efficient 
piece of code :) IMO the support for Writable in values is valuable but 
for keys hmmm... perhaps TextWritable is enough.

I did play around this thing earlier by implementing something that you 
could call external meta data. Which in concrete means that I created a 
separate sequence file of (in this particular case DMOZ data), keyed it 
by url and used that as sort of metadata during indexing phase (mapped 
together with rest of nutch data).

The benefit of this kind of approach compared to current one (static 
metadata inside crawldb) is that I can manage the metadata completely 
separated from crawldb operations and crawldb operations run faster 
because of less data to move around.

--
  Sami Siren



-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Re: [Nutch-dev] Thoughts on Parser design and dependencies

Reply via email to