Hi Nancy, Tika is what put the metadata into the parsed content in the file you are looking at. See the parse-tika plugin. You don’t need to use Tika further that the information that is in your crawled data.
Cheers, Chris ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ -----Original Message----- From: Nancy Sharma <[email protected]> Reply-To: "[email protected]" <[email protected]> Date: Wednesday, February 25, 2015 at 7:52 PM To: "[email protected]" <[email protected]> Subject: tika to parse url data content >Hello, > > >I have crawled a webpage as a part of my assignment(CS572). I have the >segment folder with the url metadata and data(parsed and otherwise). > > >I have also merged all the segments, to dump into an output file. > > >This dump file, when opened in a text editor contains some parsed content >and some encoded content, like special characters that is actually data >from that url. > > >The problem is, I am not very clear how to use tika here? Please help > > >Thanks >Nancy >

