[Tika Wiki] Update of "TikaJAXRS" by HaydenYoung

Apache Wiki Thu, 31 Oct 2013 11:37:07 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Tika Wiki" for change 
notification.


The "TikaJAXRS" page has been changed by HaydenYoung:
https://wiki.apache.org/tika/TikaJAXRS?action=diff&rev1=18&rev2=19

Comment:
Got ahead of myself re: extracting useful information using curl as some 
metadata is missing.

  Text is stored in {{{__TEXT__}}} file, metadata cvs in {{{__METADATA__}}}. 
Use "accept" header if you want TAR output.
  
  = Extracting A Document From A URL =
+ It is possible to use a remote file with TikaJAXRS by downloading it via its 
URL first then piping it to the appropriate service:
  
- It is possible to use a remote file with TikaJAXRS by downloading it via its 
URL first then piping it to the appropriate service:
  {{{
  $ curl -s "http://url/to/my.file"; | curl -X PUT -T - 
http://localhost:9998/meta
  $ curl -s "http://url/to/my.file"; | curl -X PUT -T - 
http://localhost:9998/tika
  }}}
+ The caveat with above is that it fetches the entire file, so large files such 
as video can take some time to download. Therefore, you may wish to use curl to 
get preliminary information (content type, name and size) about the file before 
you proceed:
  
- The caveat with above is that it fetches the entire file, so large files such 
as video can take some time to download. With services such as "meta" it may be 
faster to extract a remote file's header first using cURL:
  {{{
  $ curl -I http://url/to/my.file
  }}}
+ If the file should be parsed (E.g. you only want to get information about 
mp3s, mp4s and PDFs), send it on to TikaJAXRS.
- If the file's content is suitable for extraction (E.g. content type is a PDF, 
word processing document or some other text file), send it on to TikaJAXRS:
- {{{
- $ curl -s "http://url/to/my.file"; | curl -X PUT -T - 
http://localhost:9998/tika
- }}}
- While the output of cURL's header information is not as cleanly formatted as 
TikaJAXRS's "meta" service, performance may outweigh this drawback.

[Tika Wiki] Update of "TikaJAXRS" by HaydenYoung

Reply via email to