YES. I tried that out, while that one has only url as input. The problem is how to get the data of that url locally.
On Sunday, February 22, 2015, Nagarjun Pola <[email protected]> wrote: > I have just started looking up in those lines and found that interface > URLFilter has a method named "filter". And I think this is our point of > interest. > Maybe you should look at how to use this method in your plugin. > > > > > On Sun, Feb 22, 2015 at 4:41 PM, Jiaxin Ye <[email protected] > <javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote: > >> You are absolutely right! I am just throwing ideas :) If you are looking >> at local data, org.apache.nutch.segment.SegmentReader may be helpful I >> guess. As all data contents parsed are located there. >> >> On Sun, Feb 22, 2015 at 4:33 PM, Renxia Wang <[email protected] >> <javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote: >> >>> Thank you for you suggestion. I will take a look at that. There is a >>> URLUtil class in nutch's source code, but I am just wonder if that one will >>> send a request to the URL again to get the data. Cause the url's metadata >>> has already been downloaded, it is better if we can get the data locally. >>> >>> >>> On Sunday, February 22, 2015, Jiaxin Ye <[email protected] >>> <javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote: >>> >>>> Hey, >>>> >>>> I haven't started working on the deduplicatiin yet, but if I were you I >>>> will use tika library to retrieve the MIMEtype and metadata. The code is >>>> presented in the book tika. Why not try that out? :) >>>> >>>> Best, >>>> Jiaxin >>>> >>>> On Sunday, February 22, 2015, Renxia Wang <[email protected]> wrote: >>>> >>>>> Hi >>>>> >>>>> I want to develop an UrlFIlter which takes an url, takes its metadata >>>>> or even the fetched content, then use some duplicate detection algorithms >>>>> to determine if it is a duplicate of any url in bitch. However, the only >>>>> parameter passed into the Urlfilter is the url, is it possible to get the >>>>> data I want of that input url in Urlfilter? >>>>> >>>>> Thanks, >>>>> >>>>> Zhique >>>>> >>>> >> >

