Hey, I haven't started working on the deduplicatiin yet, but if I were you I will use tika library to retrieve the MIMEtype and metadata. The code is presented in the book tika. Why not try that out? :)
Best, Jiaxin On Sunday, February 22, 2015, Renxia Wang <[email protected]> wrote: > Hi > > I want to develop an UrlFIlter which takes an url, takes its metadata or > even the fetched content, then use some duplicate detection algorithms to > determine if it is a duplicate of any url in bitch. However, the only > parameter passed into the Urlfilter is the url, is it possible to get the > data I want of that input url in Urlfilter? > > Thanks, > > Zhique >

