I have just started looking up in those lines and found that interface URLFilter has a method named "filter". And I think this is our point of interest. Maybe you should look at how to use this method in your plugin.
On Sun, Feb 22, 2015 at 4:41 PM, Jiaxin Ye <[email protected]> wrote: > You are absolutely right! I am just throwing ideas :) If you are looking > at local data, org.apache.nutch.segment.SegmentReader may be helpful I > guess. As all data contents parsed are located there. > > On Sun, Feb 22, 2015 at 4:33 PM, Renxia Wang <[email protected]> wrote: > >> Thank you for you suggestion. I will take a look at that. There is a >> URLUtil class in nutch's source code, but I am just wonder if that one will >> send a request to the URL again to get the data. Cause the url's metadata >> has already been downloaded, it is better if we can get the data locally. >> >> >> On Sunday, February 22, 2015, Jiaxin Ye <[email protected]> wrote: >> >>> Hey, >>> >>> I haven't started working on the deduplicatiin yet, but if I were you I >>> will use tika library to retrieve the MIMEtype and metadata. The code is >>> presented in the book tika. Why not try that out? :) >>> >>> Best, >>> Jiaxin >>> >>> On Sunday, February 22, 2015, Renxia Wang <[email protected]> wrote: >>> >>>> Hi >>>> >>>> I want to develop an UrlFIlter which takes an url, takes its metadata >>>> or even the fetched content, then use some duplicate detection algorithms >>>> to determine if it is a duplicate of any url in bitch. However, the only >>>> parameter passed into the Urlfilter is the url, is it possible to get the >>>> data I want of that input url in Urlfilter? >>>> >>>> Thanks, >>>> >>>> Zhique >>>> >>> >

