[ 
https://issues.apache.org/jira/browse/TIKA-1239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13900364#comment-13900364
 ] 

sudheshna iyer commented on TIKA-1239:
--------------------------------------

Thank you very much for your quick response.

My program should be called from a batch process going through the loop of
files .So I need to make it thread proof and fast.
1. Is my approach correct?

2.  When I declared BodyContentHandler in the bean context, process was
failing for 3rd file itself. If I declared
new BodyContentHandler()
inside my 3rd method instead of bean context, it started working. But gave
the below error:

 Your document contained more than 100000 characters.

So I have added *resourcesize to *BodyContentHandler.  Problem is, I don't
know the size of the file ahead. Currently I have assigned *resourcesize  =
10MB.  *I am not sure whether that size is too big or might not work for
some other large files. I don't know the size of the file before.

3.  Repeating my 3rd approach here for clarity:

BodyContentHandler bodyContentHandler = new BodyContentHandler(
*resourcesize*);
Metadata metadata = new Metadata();
parser.parse(TikaInputStream.get(stream), bodyContentHandler, metadata,
parseContext);





> Using Spring and Tika together. Need to extract the content and metadata. 
> --------------------------------------------------------------------------
>
>                 Key: TIKA-1239
>                 URL: https://issues.apache.org/jira/browse/TIKA-1239
>             Project: Tika
>          Issue Type: Task
>          Components: general, metadata, parser
>            Reporter: sudheshna iyer
>            Priority: Critical
>
> I need to use spring with Tika. Is it thread safe to use the following 
> injected from bean context. I am injecting parseContext, handler and parser 
> into my class TikaImpl. 
> ================
> <bean name="parseContext" class="org.apache.tika.parser.ParseContext"></bean>
>       <bean name="parser" 
> class="org.apache.tika.parser.AutoDetectParser"></bean>
>       <bean name="handler" class="org.xml.sax.helpers.DefaultHandler"></bean>
>       
>       <bean id="tikaService" class="com.intech.tika.TikaImpl">
>       <property name="parseContext" ref="parseContext"></property>
>       <property name="parser" ref="parser"></property>
>       <property name="handler" ref="handler"></property>
>       <property name="resourcesize"><value>10485760</value></property>
>     </bean>
> ===============
> In my class I have 3 methods 1. To retrieve metadata 2. to retrieve content 
> 3. to retrieve both.
> So for 1. Retrieve metadata, I am using: 
> parser.parse(stream, handler,
>                                       metadata, parseContext)
> 2. To retrieve the content, i am using: 
> Tika tika = new Tika();
> tika.setMaxStringLength(resourcesize);
> String content = tika.parseToString(stream);
> 3. To retrieve both: I am using: 
> BodyContentHandler bodyContentHandler = new BodyContentHandler(resourcesize);
> Metadata metadata = new Metadata();
> parser.parse(TikaInputStream.get(stream), bodyContentHandler, metadata, 
> parseContext);
> Question is: 
> Is my approach thread safe? Introduced 3 methods, thinking that just getting 
> metadata from the first method is faster than the 3rd method. 
> Need your suggestion badly. Thank you in advance.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to