You are absolutely right! I am just throwing ideas :) If you are looking at local data, org.apache.nutch.segment.SegmentReader may be helpful I guess. As all data contents parsed are located there.
On Sun, Feb 22, 2015 at 4:33 PM, Renxia Wang <[email protected]> wrote: > Thank you for you suggestion. I will take a look at that. There is a > URLUtil class in nutch's source code, but I am just wonder if that one will > send a request to the URL again to get the data. Cause the url's metadata > has already been downloaded, it is better if we can get the data locally. > > > On Sunday, February 22, 2015, Jiaxin Ye <[email protected]> wrote: > >> Hey, >> >> I haven't started working on the deduplicatiin yet, but if I were you I >> will use tika library to retrieve the MIMEtype and metadata. The code is >> presented in the book tika. Why not try that out? :) >> >> Best, >> Jiaxin >> >> On Sunday, February 22, 2015, Renxia Wang <[email protected]> wrote: >> >>> Hi >>> >>> I want to develop an UrlFIlter which takes an url, takes its metadata or >>> even the fetched content, then use some duplicate detection algorithms to >>> determine if it is a duplicate of any url in bitch. However, the only >>> parameter passed into the Urlfilter is the url, is it possible to get the >>> data I want of that input url in Urlfilter? >>> >>> Thanks, >>> >>> Zhique >>> >>

