HUYLEBROECK Jeremy RD-ILAB-SSF wrote:
> If I am not wrong, segments generated by Generator are some sort of
> CrawlDatum.
> I am putting metadata in the CrawlDb (I keep information that never
> change) and I think they are copied to the segments by the Generator.
>
> But now I want to access those metadata at the Parsing or Indexing step
> to put some of them in the ParseData that were extracted (or directly in
> the index).
>
> I can't find a way to reassociate the "Content" and the Parse Object to
> their respective CrawlDb/Segment.
>
> Basically, I am trying to use CrawlDb as a database of metadata for
> every URL and want to use them at the indexing step to enrich the
> ParseData and then be able to search against them later on.
>
> Stupid Example: I know this URL is associated to color "blue", but
> doesn't have this information in the page pointed by this URL. Blue
> would be kept in the metadata of the CrawlDb, then the
> generator/fetch/parse steps are done as usual, but when indexing, blue
> should be reassociated to the parsedata that has been extracted from the
> page. 
>
> Is it feasible without changing anything in nutch? (I use nutch as a
> library more or lessand avoid changing stuff in it, I prefer redoing my
> own injector/generator/fetcher/parser and formats etc... if needed).
>
> I am going through all the different classes in nutch/hadoop now to
> understand where stuff are and if they are read and in what kind of
> object they are put.
> Any pointer to shorten my reading is very welcome ;)
>
> Thanks!
>
>
>   
hi,

The CrawlDatum keeps crawl status information about every url that is 
fetched. The class has a metedata field which is an instance of  
MapWritable, behaving similar to a HashMap. Thus I have used the 
metadata field for similar purposes. For example in the fetcher, you can 
set some property like :

datum.getMetaData().put(<key>,<value>);

and than in the indexing plugin you could retrieve it with :  
datum.getMetaData().get(<key>);






-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to