Hey,

I haven't started working on the deduplicatiin yet, but if I were you I
will use tika library to retrieve the MIMEtype and metadata. The code is
presented in the book tika. Why not try that out? :)

Best,
Jiaxin

On Sunday, February 22, 2015, Renxia Wang <[email protected]> wrote:

> Hi
>
> I want to develop an UrlFIlter which takes an url, takes its metadata or
> even the fetched content, then use some duplicate detection algorithms to
> determine if it is a duplicate of any url in bitch. However, the only
> parameter passed into the Urlfilter is the url, is it possible to get the
> data I want of that input url in Urlfilter?
>
> Thanks,
>
> Zhique
>

Reply via email to