Thanks Konstantin and Gabriele! Please feel free to email any other questions or open an issue on the Tika JIRA.
Have a good day! Tyler On Jan 29, 2015 11:43 AM, "Gabriele Guidi" <[email protected]> wrote: > Ok, thank you for your support > > Best regards > > 2015-01-29 15:14 GMT+01:00 Konstantin Gribov <[email protected]>: > > > Hi, Gabriele. > > > > If you're using InputStream which doesn't support mark/reset tika facade > > (org.apache.Tika) creates BufferedInputStream which consumes up to 8k of > > original inputStream by default, so Tika mime type detector can't find > pdf > > magic after first call. > > > > Second case (with copying to byte[]) is similar. If you do this copy > > before calling tika.detect, you consume that input stream and subsequent > > calls on that stream return application/octet-stream as default > mime-type. > > But all works fine with bytes since you have full copy of original stream > > in it. > > > > If you call tika.detect on input stream before copying it to bytes it > > falls to first case, you'll copy inputstream without first 8k to it, so > > drop pdf magic. > > > > You have to recreate input stream, copy it somewhere to temporary > resource > > (as with bytes or some temp file) or wrap it to BufferedInputStream > before > > passing it to tika.detect. > > > > -- > > Best regards, > > Konstantin Gribov > > > > Thu Jan 29 2015 at 16:07:12, Gabriele Guidi <[email protected]>: > > > > Hi > >> > >> No, I ask it with "*markSupported > >> < > http://docs.oracle.com/javase/7/docs/api/java/io/InputStream.html#markSupported() > >* > >> ()" function and it says "NO". > >> No recreation. > >> > >> The code test is very simple: > >> > >> InputStream inputsbust = content.getContentStream(); > >> System.out.println(" mark and reset inputStream ? > >> "+(inputsbust.markSupported()?"YES":"NO")); > >> System.out.println(" 1 mime : " + tika.detect(inputsbust)); > >> System.out.println(" 2 mime : " + tika.detect(inputsbust)); > >> byte[] bytes = IOUtils.toByteArray(inputsbust); > >> System.out.println(" 3 mime : " + tika.detect(bytes)); > >> System.out.println(" 3.2 mime : " + tika.detect(bytes)); > >> > >> > >> The result: > >> > >> mark and reset of inputStream ? NO > >> > >> 1 mime : application/pdf > >> 2 mime : application/octet-stream > >> 3 mime : application/octet-stream > >> 3.2 mime : application/octet-stream > >> > >> > >> If i put the 5th line ("byte[] bytes = > IOUtils.toByteArray(inputsbust);") > >> as second line the result is: > >> > >> mark and reset of inputStream ? NO > >> > >> 1 mime : application/octet-stream > >> 2 mime : application/octet-stream > >> 3 mime : application/pdf > >> 3.2 mime : application/pdf > >> > >> > >> I hope it helps > >> > >> Thanks > >> > >> > >> 2015-01-29 10:49 GMT+01:00 Konstantin Gribov <[email protected]>: > >> > >>> Hi, > >>> > >>> Does this InputStream support mark/reset fuctionality? Is InputStream > >>> recreated before each subsequent call to tika.detect or it called on > >>> partially consumed stream (in case mark isn't supported)? > >>> > >>> -- > >>> Best regards, > >>> Konstantin Gribov > >>> > >>> Thu Jan 29 2015 at 9:25:28, Mattmann, Chris A (3980) < > >>> [email protected]>: > >>> > >>> Dear Gabriele, > >>>> > >>>> Thanks for your question. It should be sent to [email protected] > >>>> (moving [email protected] to BCC). > >>>> > >>>> I’ll take a look tomorrow if someone else hasn’t answered yet. > >>>> > >>>> Cheers, > >>>> Chris > >>>> > >>>> > >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >>>> Chris Mattmann, Ph.D. > >>>> Chief Architect > >>>> Instrument Software and Science Data Systems Section (398) > >>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > >>>> Office: 168-519, Mailstop: 168-527 > >>>> Email: [email protected] > >>>> WWW: http://sunset.usc.edu/~mattmann/ > >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >>>> Adjunct Associate Professor, Computer Science Department > >>>> University of Southern California, Los Angeles, CA 90089 USA > >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> -----Original Message----- > >>>> From: Gabriele Guidi <[email protected]> > >>>> Date: Wednesday, January 28, 2015 at 5:25 AM > >>>> To: "[email protected]" <[email protected]> > >>>> Subject: multiple detect call -> different results (tika 1.7) > >>>> > >>>> > > >>>> > > >>>> >Hi, > >>>> > > >>>> > > >>>> >I found a strange behavior. I have p7m file, then I extract file > inside > >>>> >the signed one, after that I use tika to discover mime type, the > first > >>>> >call it gives me "application/pdf" (that's correct). BUT every next > >>>> call > >>>> >to the detect method of Tika to the > >>>> > same inputStream gives me "application/octet-stream". ...why? > >>>> >I cannot understand the behavior ...and find a solution. > >>>> > > >>>> > > >>>> >Just a snipped of code: > >>>> > > >>>> > > >>>> > > >>>> >InputStream inputsbust = content.getContentStream(); > >>>> > > >>>> > > >>>> > > >>>> > > >>>> > > >>>> > > >>>> > > >>>> >System.out.println(" 1 mime " + filepath + " : " > >>>> >+ tika.detect(inputsbust)); > >>>> >System.out.println(" 2 mime " + filepath + " : " > >>>> >+ tika.detect(inputsbust)); > >>>> >System.out.println(" 3 mime " + filepath + " : " > >>>> >+ tika.detect(inputsbust)); > >>>> > > >>>> > > >>>> > > >>>> >Result: > >>>> > > >>>> > 1 mime /home/gguidi/01_file.pdf : application/pdf > >>>> > 2 mime /home/gguidi/01_file.pdf : application/octet-stream > >>>> > 3 mime /home/gguidi/01_file.pdf : application/octet-stream > >>>> > > >>>> > > >>>> > > >>>> > > >>>> > > >>>> > > >>>> > > >>>> > > >>>> >Thanks > >>>> > > >>>> > > >>>> >-- > >>>> > > >>>> > > >>>> >Gabriele Guidi > >>>> >Direzione Pubblica Amministrazione > >>>> >[email protected] > >>>> > > >>>> >Engineering Ingegneria Informatica spa > >>>> >Via Marconi, 10 - 40122, Bologna > >>>> >Tel. +39-051.0435135 > >>>> >www.eng.it <http://www.eng.it> > >>>> > > >>>> > > >>>> >Rispetta l'ambiente. Non stampare questa e-mail se non necessario. > >>>> >Respect the environment. Please don't print this e-mail unless you > >>>> really > >>>> >need to. > >>>> >Le informazioni trasmesse sono destinate esclusivamente alla persona > o > >>>> >alla società in indirizzo e sono da intendersi confidenziali e > >>>> riservate. > >>>> >Ogni trasmissione, inoltro, diffusione o altro uso > >>>> > di queste informazioni a persone o società differenti dal > >>>> destinatario è > >>>> >proibita. Se ricevete questa comunicazione per errore, contattate il > >>>> >mittente e cancellate le informazioni da ogni computer. > >>>> >The information transmitted is intended only for the person or entity > >>>> to > >>>> >which it is addressed and may contain confidential and/or privileged > >>>> >material. Any review, retransmission, dissemination or other use of, > or > >>>> >taking of any action in reliance upon, this > >>>> > information by persons or entities other than the intended recipient > >>>> is > >>>> >prohibited. If you received this in error, please contact the sender > >>>> and > >>>> >delete the material from any computer. > >>>> > > >>>> > > >>>> > > >>>> > > >>>> > > >>>> > > >>>> > >>>> > >> > >> > >> -- > >> > >> > >> > >> * Gabriele Guidi* > >> > >> > >> Direzione Pubblica Amministrazione > >> [email protected] > >> > >> *Engineering Ingegneria Informatica spa* > >> Via Marconi, 10 - 40122, Bologna > >> > >> > >> Tel. +39-051.0435135 > >> www.eng.it > >> > >> Rispetta l'ambiente. Non stampare questa e-mail se non necessario. > >> Respect the environment. Please don't print this e-mail unless you > really > >> need to. > >> > >> Le informazioni trasmesse sono destinate esclusivamente alla persona o > >> alla società in indirizzo e sono da intendersi confidenziali e > riservate. > >> Ogni trasmissione, inoltro, diffusione o altro uso di queste > informazioni a > >> persone o società differenti dal destinatario è proibita. Se ricevete > >> questa comunicazione per errore, contattate il mittente e cancellate le > >> informazioni da ogni computer. > >> The information transmitted is intended only for the person or entity to > >> which it is addressed and may contain confidential and/or privileged > >> material. Any review, retransmission, dissemination or other use of, or > >> taking of any action in reliance upon, this information by persons or > >> entities other than the intended recipient is prohibited. If you > received > >> this in error, please contact the sender and delete the material from > any > >> computer. > >> > > > > > -- > > > > * Gabriele Guidi* > Direzione Pubblica Amministrazione > [email protected] > > *Engineering Ingegneria Informatica spa* > Via Marconi, 10 - 40122, Bologna > Tel. +39-051.0435135 > www.eng.it > > Rispetta l'ambiente. Non stampare questa e-mail se non necessario. > Respect the environment. Please don't print this e-mail unless you really > need to. > > Le informazioni trasmesse sono destinate esclusivamente alla persona o alla > società in indirizzo e sono da intendersi confidenziali e riservate. Ogni > trasmissione, inoltro, diffusione o altro uso di queste informazioni a > persone o società differenti dal destinatario è proibita. Se ricevete > questa comunicazione per errore, contattate il mittente e cancellate le > informazioni da ogni computer. > The information transmitted is intended only for the person or entity to > which it is addressed and may contain confidential and/or privileged > material. Any review, retransmission, dissemination or other use of, or > taking of any action in reliance upon, this information by persons or > entities other than the intended recipient is prohibited. If you received > this in error, please contact the sender and delete the material from any > computer. >
