Re: How to read metadata/content of an URL in URLFilter?

Nagarjun Pola Sun, 22 Feb 2015 16:50:09 -0800

I have just started looking up in those lines and found that interface
URLFilter has a method named "filter". And I think this is our point of
interest.
Maybe you should look at how to use this method in your plugin.





On Sun, Feb 22, 2015 at 4:41 PM, Jiaxin Ye <[email protected]> wrote:

> You are absolutely right! I am just throwing ideas :) If you are looking
> at local data, org.apache.nutch.segment.SegmentReader may be helpful I
> guess. As all data contents parsed are located there.
>
> On Sun, Feb 22, 2015 at 4:33 PM, Renxia Wang <[email protected]> wrote:
>
>> Thank you for you suggestion. I will take a look at that. There is a
>> URLUtil class in nutch's source code, but I am just wonder if that one will
>> send a request to the URL again to get the data. Cause the url's metadata
>> has already been downloaded, it is better if we can get the data locally.
>>
>>
>> On Sunday, February 22, 2015, Jiaxin Ye <[email protected]> wrote:
>>
>>> Hey,
>>>
>>> I haven't started working on the deduplicatiin yet, but if I were you I
>>> will use tika library to retrieve the MIMEtype and metadata. The code is
>>> presented in the book tika. Why not try that out? :)
>>>
>>> Best,
>>> Jiaxin
>>>
>>> On Sunday, February 22, 2015, Renxia Wang <[email protected]> wrote:
>>>
>>>> Hi
>>>>
>>>> I want to develop an UrlFIlter which takes an url, takes its metadata
>>>> or even the fetched content, then use some duplicate detection algorithms
>>>> to determine if it is a duplicate of any url in bitch. However, the only
>>>> parameter passed into the Urlfilter is the url, is it possible to get the
>>>> data I want of that input url in Urlfilter?
>>>>
>>>> Thanks,
>>>>
>>>> Zhique
>>>>
>>>
>

Re: How to read metadata/content of an URL in URLFilter?

Reply via email to