[
https://issues.apache.org/jira/browse/TIKA-3523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17489843#comment-17489843
]
Tim Allison commented on TIKA-3523:
-----------------------------------
AFK now. Will take a look first thing tomorrow.
> A replacement for enableFileUrl or Support for Google Cloud
> -----------------------------------------------------------
>
> Key: TIKA-3523
> URL: https://issues.apache.org/jira/browse/TIKA-3523
> Project: Tika
> Issue Type: Wish
> Components: tika-server
> Affects Versions: 2.0.0
> Reporter: Fatih Pazarbasi
> Priority: Minor
>
> Hello,
> I have a setup where users upload their files to a cloud bucket and I forward
> the fileUrl to make ocr on them in a serverless cloud instance. I do it this
> way so the users do not contact with the Tika Server and I have a copy of
> what they've sent to process it. Also they have nothing to do with the
> unprocessed response.
> Now that you've removed the enableFileUrl... I have to download the files to
> the backend instance from the cloud bucket they have uploaded their files to,
> and put them to /tika server back again...
> I tried the following config.xml to work around the situation but it was in
> vain...
> For the made up url:
> [https://firebasestorage.googleapis.com/v0/b/abcd-efgh.appspot.com/o/somefilethatdoesnotexist.pdf|https://firebasestorage.googleapis.com/v0/b/abcd-efgh.appspot.com/o/]
> {code:java}
> <fetchers>
> <fetcher class="org.apache.tika.pipes.fetcher.fs.FileSystemFetcher">
> <params>
> <name>fsf</name>
>
> <basePath>https://firebasestorage.googleapis.com/v0/b/abcd-efgh.appspot.com/o</basePath>
>
> </params>
> </fetcher>
> </fetchers>
> <emitters>
> <emitter class="org.apache.tika.pipes.emitter.fs.FileSystemEmitter">
> <params>
> <name>fse</name>
> <basePath>gs://abcd-efgh.appspot.com/users</basePath>
> </params>
> </emitter>
> </emitters>
> <server>
> <params>
> <enableUnsecureFeatures>true</enableUnsecureFeatures>
> </params>
> </server>
> <pipes>
> <params>
> <tikaConfig>/path/to/tika-config.xml</tikaConfig>
> </params>
> </pipes>{code}
> {code:java}
> headers: {
> Accept: 'text/plain',
> 'User-Agent': 'Firebase Functions',
> fetcherName: 'fsf',
> fetchKey: 'somefilethatdoesnotexist.pdf',
> },{code}
> It doesn't support the gs:// Google Storage bucket either. I have all the
> necessary permissions but it didn't help. I'm using a dockerized version of
> tika server, so the file System does not seem to be my concern...
>
> In the golden times of 1.2x Iwas simply using:
>
> {code:java}
> headers: {
> Accept: 'text/plain',
> 'User-Agent': 'Firebase Functions',
> fileUrl:
> 'https://firebasestorage.googleapis.com/v0/b/abcd-efgh.appspot.com/o/somefilethatdoesnotexist.pdf',
>
> },{code}
>
>
> Am I missing something? If not my wish is that can you please make it so
> that fetchName is the definitive first part of the old fileUrl and fetchKey
> is the specific pointer to a file?
> This way I have control over the urls that's been sent to tika server to some
> extend, unlike enableFileUrl and also eat my cake without creating extra
> traffic on the backend by downloading from the bucket and uploading to tika.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)