HUYLEBROECK Jeremy RD-ILAB-SSF wrote: > If I am not wrong, segments generated by Generator are some sort of > CrawlDatum. > I am putting metadata in the CrawlDb (I keep information that never > change) and I think they are copied to the segments by the Generator. > > But now I want to access those metadata at the Parsing or Indexing step > to put some of them in the ParseData that were extracted (or directly in > the index). > > I can't find a way to reassociate the "Content" and the Parse Object to > their respective CrawlDb/Segment. > > Basically, I am trying to use CrawlDb as a database of metadata for > every URL and want to use them at the indexing step to enrich the > ParseData and then be able to search against them later on. > > Stupid Example: I know this URL is associated to color "blue", but > doesn't have this information in the page pointed by this URL. Blue > would be kept in the metadata of the CrawlDb, then the > generator/fetch/parse steps are done as usual, but when indexing, blue > should be reassociated to the parsedata that has been extracted from the > page. > > Is it feasible without changing anything in nutch? (I use nutch as a > library more or lessand avoid changing stuff in it, I prefer redoing my > own injector/generator/fetcher/parser and formats etc... if needed). > > I am going through all the different classes in nutch/hadoop now to > understand where stuff are and if they are read and in what kind of > object they are put. > Any pointer to shorten my reading is very welcome ;) > > Thanks! > > > hi,
The CrawlDatum keeps crawl status information about every url that is fetched. The class has a metedata field which is an instance of MapWritable, behaving similar to a HashMap. Thus I have used the metadata field for similar purposes. For example in the fetcher, you can set some property like : datum.getMetaData().put(<key>,<value>); and than in the indexing plugin you could retrieve it with : datum.getMetaData().get(<key>); ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
