You are absolutely right! I am just throwing ideas :) If you are looking at
local data, org.apache.nutch.segment.SegmentReader may be helpful I guess.
As all data contents parsed are located there.

On Sun, Feb 22, 2015 at 4:33 PM, Renxia Wang <[email protected]> wrote:

> Thank you for you suggestion. I will take a look at that. There is a
> URLUtil class in nutch's source code, but I am just wonder if that one will
> send a request to the URL again to get the data. Cause the url's metadata
> has already been downloaded, it is better if we can get the data locally.
>
>
> On Sunday, February 22, 2015, Jiaxin Ye <[email protected]> wrote:
>
>> Hey,
>>
>> I haven't started working on the deduplicatiin yet, but if I were you I
>> will use tika library to retrieve the MIMEtype and metadata. The code is
>> presented in the book tika. Why not try that out? :)
>>
>> Best,
>> Jiaxin
>>
>> On Sunday, February 22, 2015, Renxia Wang <[email protected]> wrote:
>>
>>> Hi
>>>
>>> I want to develop an UrlFIlter which takes an url, takes its metadata or
>>> even the fetched content, then use some duplicate detection algorithms to
>>> determine if it is a duplicate of any url in bitch. However, the only
>>> parameter passed into the Urlfilter is the url, is it possible to get the
>>> data I want of that input url in Urlfilter?
>>>
>>> Thanks,
>>>
>>> Zhique
>>>
>>

Reply via email to