[ 
https://issues.apache.org/jira/browse/TIKA-3523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fatih Pazarbasi updated TIKA-3523:
----------------------------------
    Description: 
Hello,

I have a setup where users upload their files to a cloud bucket and I forward 
the fileUrl to make ocr on them in a serverless cloud instance. I do it this 
way so the users do not contact with the Tika Server and I have a copy of what 
they've sent to process it. Also they have nothing to do with the unprocessed 
response.

Now that you've removed the enableFileUrl... I have to download the files to 
the backend instance from the cloud bucket they have uploaded their files to, 
and put them to /tika server back again...

I tried the following config.xml to work around the situation but it was in 
vain...
  For the made up url: 
[https://firebasestorage.googleapis.com/v0/b/abcd-efgh.appspot.com/o/somefilethatdoesnotexist.pdf|https://firebasestorage.googleapis.com/v0/b/abcd-efgh.appspot.com/o/]
{code:java}
<fetchers> 
 <fetcher class="org.apache.tika.pipes.fetcher.fs.FileSystemFetcher"> 
  <params> 
   <name>fsf</name> 
   
<basePath>https://firebasestorage.googleapis.com/v0/b/abcd-efgh.appspot.com/o</basePath>
 
  </params> 
 </fetcher> 
</fetchers> 
<emitters> 
 <emitter class="org.apache.tika.pipes.emitter.fs.FileSystemEmitter"> 
  <params> 
   <name>fse</name> 
   <basePath>gs://abcd-efgh.appspot.com/users</basePath> 
  </params> 
 </emitter> 
</emitters> 
<server> 
 <params> 
  <enableUnsecureFeatures>true</enableUnsecureFeatures> 
 </params> 
</server> 
<pipes> 
 <params> 
  <tikaConfig>/path/to/tika-config.xml</tikaConfig> 
 </params> 
</pipes>{code}
{code:java}
headers: {         
Accept: 'text/plain',         
'User-Agent': 'Firebase Functions',         
fetcherName: 'fsf',         
fetchKey: 'somefilethatdoesnotexist.pdf',   
},{code}
It doesn't support the gs:// Google Storage bucket either. I have all the 
necessary permissions but it didn't help.
  
 In the golden times of 1.2x Iwas simply using:
  
{code:java}
headers: {               
Accept: 'text/plain',               
'User-Agent': 'Firebase Functions',               
fileUrl: 
'https://firebasestorage.googleapis.com/v0/b/abcd-efgh.appspot.com/o/somefilethatdoesnotexist.pdf',
             
},{code}
 
  
 Am I missing something? If not my wish is that can you please make it so that 
fetchName is the definitive  first part of the old fileUrl and fetchKey is the 
specific pointer to a file?

This way I have control over the urls that's been sent to tika server to some 
extend, unlike enableFileUrl and also eat my cake without creating extra 
traffic on the backend by downloading from the bucket and uploading to tika. 

  was:
Hello,

I have a setup where users upload their files to a cloud bucket and I forward 
the fileUrl to make ocr on them in a serverless cloud instance. I do it this 
way so the users do not contact with the Tika Server and I have a copy of what 
they've sent to process it. Also they have nothing to do with the unprocessed 
response.

Now that you've removed the enableFileUrl... I have to download the files to 
the backend instance from the cloud bucket they have uploaded their files to, 
and put them to /tika server back again...

I tried the following config.xml to work around the situation but it was in 
vain...
 For the made up url: 
[https://firebasestorage.googleapis.com/v0/b/abcd-efgh.appspot.com/o/somefilethatdoesnotexist.pdf|https://firebasestorage.googleapis.com/v0/b/abcd-efgh.appspot.com/o/]
{code:java}
<fetchers> 
 <fetcher class="org.apache.tika.pipes.fetcher.fs.FileSystemFetcher"> 
  <params> 
   <name>fsf</name> 
   
<basePath>https://firebasestorage.googleapis.com/v0/b/abcd-efgh.appspot.com/o</basePath>
 
  </params> 
 </fetcher> 
</fetchers> 
<emitters> 
 <emitter class="org.apache.tika.pipes.emitter.fs.FileSystemEmitter"> 
  <params> 
   <name>fse</name> 
   <basePath>gs://abcd-efgh.appspot.com/users</basePath> 
  </params> 
 </emitter> 
</emitters> 
<server> 
 <params> 
  <enableUnsecureFeatures>true</enableUnsecureFeatures> 
 </params> 
</server> 
<pipes> 
 <params> 
  <tikaConfig>/path/to/tika-config.xml</tikaConfig> 
 </params> 
</pipes>{code}
{code:java}
headers: {         
Accept: 'text/plain',         
'User-Agent': 'Firebase Functions',         
fetcherName: 'fsf',         
fetchKey: 'somefilethatdoesnotexist.pdf',   
},{code}
It doesn't support the gs:// Google Storage bucket either. I have all the 
necessary permissions but it didn't help.
 
In the golden times of 1.2x Iwas simply using:
 
{code:java}
headers: {               
Accept: 'text/plain',               
'User-Agent': 'Firebase Functions',               
fileUrl: 
'https://firebasestorage.googleapis.com/v0/b/abcd-efgh.appspot.com/o/somefilethatdoesnotexist.pdf',
             
},{code}
 
 
Am I missing something?


> A replacement for enableFileUrl or Support for Google Cloud
> -----------------------------------------------------------
>
>                 Key: TIKA-3523
>                 URL: https://issues.apache.org/jira/browse/TIKA-3523
>             Project: Tika
>          Issue Type: Wish
>          Components: tika-server
>    Affects Versions: 2.0.0
>            Reporter: Fatih Pazarbasi
>            Priority: Minor
>
> Hello,
> I have a setup where users upload their files to a cloud bucket and I forward 
> the fileUrl to make ocr on them in a serverless cloud instance. I do it this 
> way so the users do not contact with the Tika Server and I have a copy of 
> what they've sent to process it. Also they have nothing to do with the 
> unprocessed response.
> Now that you've removed the enableFileUrl... I have to download the files to 
> the backend instance from the cloud bucket they have uploaded their files to, 
> and put them to /tika server back again...
> I tried the following config.xml to work around the situation but it was in 
> vain...
>   For the made up url: 
> [https://firebasestorage.googleapis.com/v0/b/abcd-efgh.appspot.com/o/somefilethatdoesnotexist.pdf|https://firebasestorage.googleapis.com/v0/b/abcd-efgh.appspot.com/o/]
> {code:java}
> <fetchers> 
>  <fetcher class="org.apache.tika.pipes.fetcher.fs.FileSystemFetcher"> 
>   <params> 
>    <name>fsf</name> 
>    
> <basePath>https://firebasestorage.googleapis.com/v0/b/abcd-efgh.appspot.com/o</basePath>
>  
>   </params> 
>  </fetcher> 
> </fetchers> 
> <emitters> 
>  <emitter class="org.apache.tika.pipes.emitter.fs.FileSystemEmitter"> 
>   <params> 
>    <name>fse</name> 
>    <basePath>gs://abcd-efgh.appspot.com/users</basePath> 
>   </params> 
>  </emitter> 
> </emitters> 
> <server> 
>  <params> 
>   <enableUnsecureFeatures>true</enableUnsecureFeatures> 
>  </params> 
> </server> 
> <pipes> 
>  <params> 
>   <tikaConfig>/path/to/tika-config.xml</tikaConfig> 
>  </params> 
> </pipes>{code}
> {code:java}
> headers: {         
> Accept: 'text/plain',         
> 'User-Agent': 'Firebase Functions',         
> fetcherName: 'fsf',         
> fetchKey: 'somefilethatdoesnotexist.pdf',   
> },{code}
> It doesn't support the gs:// Google Storage bucket either. I have all the 
> necessary permissions but it didn't help.
>   
>  In the golden times of 1.2x Iwas simply using:
>   
> {code:java}
> headers: {               
> Accept: 'text/plain',               
> 'User-Agent': 'Firebase Functions',               
> fileUrl: 
> 'https://firebasestorage.googleapis.com/v0/b/abcd-efgh.appspot.com/o/somefilethatdoesnotexist.pdf',
>              
> },{code}
>  
>   
>  Am I missing something? If not my wish is that can you please make it so 
> that fetchName is the definitive  first part of the old fileUrl and fetchKey 
> is the specific pointer to a file?
> This way I have control over the urls that's been sent to tika server to some 
> extend, unlike enableFileUrl and also eat my cake without creating extra 
> traffic on the backend by downloading from the bucket and uploading to tika. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to