Hi Nancy,

Tika is what put the metadata into the parsed content
in the file you are looking at. See the parse-tika
plugin. You don’t need to use Tika further that the
information that is in  your crawled data.

Cheers,
Chris


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: [email protected]
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++






-----Original Message-----
From: Nancy Sharma <[email protected]>
Reply-To: "[email protected]" <[email protected]>
Date: Wednesday, February 25, 2015 at 7:52 PM
To: "[email protected]" <[email protected]>
Subject: tika to parse url data content

>Hello,
>
>
>I have crawled a webpage as a part of my assignment(CS572). I have the
>segment folder with the url metadata and data(parsed and otherwise).
>
>
>I have also merged all the segments, to dump into an output file.
>
>
>This dump file, when opened in a text editor contains some parsed content
>and some encoded content, like special characters that is actually data
>from that url.
>
>
>The problem is, I am not very clear how to use tika here? Please help
>
>
>Thanks
>Nancy
>

Reply via email to