YES. I tried that out, while that one has only url as input. The problem is
how to get the data of that url locally.

On Sunday, February 22, 2015, Nagarjun Pola <[email protected]> wrote:

> I have just started looking up in those lines and found that interface
> URLFilter has a method named "filter". And I think this is our point of
> interest.
> Maybe you should look at how to use this method in your plugin.
>
>
>
>
> On Sun, Feb 22, 2015 at 4:41 PM, Jiaxin Ye <[email protected]
> <javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote:
>
>> You are absolutely right! I am just throwing ideas :) If you are looking
>> at local data, org.apache.nutch.segment.SegmentReader may be helpful I
>> guess. As all data contents parsed are located there.
>>
>> On Sun, Feb 22, 2015 at 4:33 PM, Renxia Wang <[email protected]
>> <javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote:
>>
>>> Thank you for you suggestion. I will take a look at that. There is a
>>> URLUtil class in nutch's source code, but I am just wonder if that one will
>>> send a request to the URL again to get the data. Cause the url's metadata
>>> has already been downloaded, it is better if we can get the data locally.
>>>
>>>
>>> On Sunday, February 22, 2015, Jiaxin Ye <[email protected]
>>> <javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote:
>>>
>>>> Hey,
>>>>
>>>> I haven't started working on the deduplicatiin yet, but if I were you I
>>>> will use tika library to retrieve the MIMEtype and metadata. The code is
>>>> presented in the book tika. Why not try that out? :)
>>>>
>>>> Best,
>>>> Jiaxin
>>>>
>>>> On Sunday, February 22, 2015, Renxia Wang <[email protected]> wrote:
>>>>
>>>>> Hi
>>>>>
>>>>> I want to develop an UrlFIlter which takes an url, takes its metadata
>>>>> or even the fetched content, then use some duplicate detection algorithms
>>>>> to determine if it is a duplicate of any url in bitch. However, the only
>>>>> parameter passed into the Urlfilter is the url, is it possible to get the
>>>>> data I want of that input url in Urlfilter?
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Zhique
>>>>>
>>>>
>>
>

Reply via email to