[ 
https://issues.apache.org/jira/browse/TIKA-3226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas DiPiazza updated TIKA-3226:
------------------------------------
    Description: 
Let's say you call the following api to parse a file and get its metadata and 
body content:

{code}
/rmeta/text
{code}

In order to do this, the caller needs to send the file to the tika server, then 
get the metadata and body sent to the caller. When you are working in 
microservices, this causes a lot of inner-service network communication.

You can cut down on a majority of this overhead by using the local file system 
optimization. So that you send a file path instead of the entire file. But this 
obviously only works when you are on the same machine.

Ideally - we would have a way to deploy "connector plugins" into tika, and be 
able to send files to be parsed with these plugins (asynchronously?).

{code}
/connector/{fetcherId}/{emitterId}
{code}

The Fetcher interface:

init(Map initParams)
  - initializes the fetcher (for example, initialize an http connection pool, 
etc)

void fetch(Map parseParams, Metadata metadata, OutputStream bodyOutputStream)
  - fetches the document indicated by parseParams and does whatever it is you 
want with it (for example, download a file from a web data source, then index 
the document into Solr). Sends the body to bodyOutputStream and metadata object 
will be populated with the metadata).

The Emitter interface would be 

init(Map initParams)
  - initializes the emitter. (for example, initialize a buffer to store output 
documents to solr, connect to solr, etc)

void emit(Map parseParams, Fetcher fetcher)
  - fetches and parses the "document" using the passed in fetcher, then emits 
it meaningfully.

We could provide the most common fetchers and emitters such as:

HttpFetcher
S3Fetcher
SolrEmitter
...



  was:
Let's say you call the following api to parse a file and get its metadata and 
body content:

{code}
/rmeta/text
{code}

In order to do this, the caller needs to send the file to the tika server, then 
get the metadata and body sent to the caller. When you are working in 
microservices, this causes a lot of inner-service network communication.

You can cut down on a majority of this overhead by using the local file system 
optimization. So that you send a file path instead of the entire file. But this 
obviously only works when you are on the same machine.

Ideally - we would have a way to deploy "connector plugins" into tika, and be 
able to send files to be parsed with these plugins asynchronously.

{code}
/connector/{fetcherId}/{emitterId}
{code}

The Fetcher interface:

init(Map initParams)
  - initializes the fetcher (for example, initialize an http connection pool, 
etc)

void fetch(Map parseParams, Metadata metadata, OutputStream bodyOutputStream)
  - fetches the document indicated by parseParams and does whatever it is you 
want with it (for example, download a file from a web data source, then index 
the document into Solr). Sends the body to bodyOutputStream and metadata object 
will be populated with the metadata).

The Emitter interface would be 

init(Map initParams)
  - initializes the emitter. (for example, initialize a buffer to store output 
documents to solr, connect to solr, etc)

void emit(Map parseParams, Fetcher fetcher)
  - fetches and parses the "document" using the passed in fetcher, then emits 
it meaningfully.

We could provide the most common fetchers and emitters such as:

HttpFetcher
S3Fetcher
SolrEmitter
...




> Add custom connector endpoint
> -----------------------------
>
>                 Key: TIKA-3226
>                 URL: https://issues.apache.org/jira/browse/TIKA-3226
>             Project: Tika
>          Issue Type: New Feature
>          Components: server
>            Reporter: Nicholas DiPiazza
>            Priority: Major
>
> Let's say you call the following api to parse a file and get its metadata and 
> body content:
> {code}
> /rmeta/text
> {code}
> In order to do this, the caller needs to send the file to the tika server, 
> then get the metadata and body sent to the caller. When you are working in 
> microservices, this causes a lot of inner-service network communication.
> You can cut down on a majority of this overhead by using the local file 
> system optimization. So that you send a file path instead of the entire file. 
> But this obviously only works when you are on the same machine.
> Ideally - we would have a way to deploy "connector plugins" into tika, and be 
> able to send files to be parsed with these plugins (asynchronously?).
> {code}
> /connector/{fetcherId}/{emitterId}
> {code}
> The Fetcher interface:
> init(Map initParams)
>   - initializes the fetcher (for example, initialize an http connection pool, 
> etc)
> void fetch(Map parseParams, Metadata metadata, OutputStream bodyOutputStream)
>   - fetches the document indicated by parseParams and does whatever it is you 
> want with it (for example, download a file from a web data source, then index 
> the document into Solr). Sends the body to bodyOutputStream and metadata 
> object will be populated with the metadata).
> The Emitter interface would be 
> init(Map initParams)
>   - initializes the emitter. (for example, initialize a buffer to store 
> output documents to solr, connect to solr, etc)
> void emit(Map parseParams, Fetcher fetcher)
>   - fetches and parses the "document" using the passed in fetcher, then emits 
> it meaningfully.
> We could provide the most common fetchers and emitters such as:
> HttpFetcher
> S3Fetcher
> SolrEmitter
> ...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to