problem with the inputstream after calling the detect(InputStream in) method
Hi, I have been using tika for a while now without any problems and I am a big fan of the software. I wanted to do my part and report what I suspect might be a bug. My code uses two different libraries, javaMail, java-libpst, and I am unit testing with dumpster. When I send the email, the last unit test that I built with dumbpster was to make sure that all of the attachments were appended correctly, this failed. After doing some nitty and gritty debugging, I discovered that if I positioned a System.out.println(in.read()); directly before where I was calling tika, it would yield the correct number on the console. However, if I used the same command after where tiks was called for this case, it read -1. public void sendAsEmail(PSTMessage email, String parent, String dir) throws IOException, MessagingException, PSTException { String subject = email.getSubject(); String to = primaryRecipientsEmail(email); String from = email.getSenderEmailAddress(); if (!isValidEmailAddress(from)) { from = emptyfromstr...@placeholder.com; } Properties props = new Properties(); props.put(mail.transport.protocol, smtp); props.put(mail.smtp.host, localhost); props.put(mail.smtp.auth, false); props.put(mail.debug, false); props.put(mail.smtp.port, 3025);//change back to 25 Session session = Session.getDefaultInstance(props); Transport transport = session.getTransport(smtp); transport.connect(); Message message = new MimeMessage(session); message.addHeader(Parent-Info, parent); message.addHeader(directory, dir); message.setSubject(subject); messageBodyPart.setText(email.getBody()); multipart.addBodyPart(messageBodyPart); message.setFrom(new InternetAddress(from)); message.setRecipients(Message.RecipientType.TOhttp://message.recipienttype.to/, InternetAddress .parse(to)); try { String transportHeaders = email.getTransportMessageHeaders(); String[] headers = parseTransporHeaders(transportHeaders); for (String header : headers) { messageBodyPart.addHeaderLine(header); multipart.addBodyPart(messageBodyPart); } } catch (Exception e) { log.info(missing chunk is transport headers: + e); } try { if(email.hasAttachments()){ int attachmentIndex = 0; while (attachmentIndex email.getNumberOfAttachments()) { PSTAttachment attachment = email.getAttachment(attachmentIndex); InputStream in= attachment.getFileInputStream(); if (attachment.getAttachMethod() != PSTAttachment.ATTACHMENT_METHOD_EMBEDDED attachment.getAttachMethod() != PSTAttachment.ATTACHMENT_METHOD_OLE) { String filename = attachment.getFilename(); String mime = tika.detect(in); //here is where I called tika for use in a method that has since been depreciated. messageBodyPart = new MimeBodyPart(); messageBodyPart.attachFile(file); messageBodyPart.setFileName(filename); multipart.addBodyPart(messageBodyPart); } else { log.info(not base 64 file: + attachment.getFilename()); } in.close(); attachmentIndex++; } } }catch(Exception e){ log.info(failed attaching file to +e); } message.setContent(multipart); transport.sendMessage(message, message.getAllRecipients()); transport.close(); } Following the advice of Ken Krugler, I though I would share this on this list to see if it was an error in my code or an issue in tika.
Re: problem with the inputstream after calling the detect(InputStream in) method
Ok, thanks. That was my problem. Also, I read your book and enjoyed it very much. Is this the forum where I could bring up an issue I found with the Tika-JAX-RS server? On Mon, Sep 30, 2013 at 10:45 AM, Jukka Zitting jukka.zitt...@gmail.comwrote: Hi, On Mon, Sep 30, 2013 at 10:36 AM, kevin slote kslo...@gmail.com wrote: InputStream in= attachment.getFileInputStream(); [...] String mime = tika.detect(in); See the javadocs [1]: If the document stream supports the mark feature, then the stream is marked and reset to the original position before this method returns I believe the stream you're using does not support the mark feature (see [2]), which makes it impossible for Tika to restore the original state of the stream once type detection is done. Using BufferedInputStream [3] should fix your problem: InputStream in= new BufferedInputStream(attachment.getFileInputStream()); [1] http://tika.apache.org/1.4/api/org/apache/tika/Tika.html#detect(java.io.InputStream) [2] http://docs.oracle.com/javase/7/docs/api/java/io/InputStream.html#mark(int) [3] http://docs.oracle.com/javase/7/docs/api/java/io/BufferedInputStream.html BR, Jukka Zitting
[jira] [Closed] (TIKA-1162) content-type/charset problem with RFC822Parser
[ https://issues.apache.org/jira/browse/TIKA-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maciej Lizewski closed TIKA-1162. - Resolution: Cannot Reproduce cannot reproduce on newer tika version, so probably it is not an issue any more. content-type/charset problem with RFC822Parser -- Key: TIKA-1162 URL: https://issues.apache.org/jira/browse/TIKA-1162 Project: Tika Issue Type: Bug Components: parser Reporter: Maciej Lizewski RFC822Parser (mime mail) uses MailContentHandler which internally uses AutoDetectParser to handle each mime part. The problem is that MailContentHandler reads mime part headers and sets CONTENT_TYPE and CONTENT_ENCODING metadata properly and passes this metadata to AutoDetectParser::parse method. But that method ignores those headers and overwrites it: MediaType type = this.getDetector().detect(tis, metadata); metadata.set(Metadata.CONTENT_TYPE, type.toString()); this leads to some additional recursion loops (Detector returns message/rfc822 mime type instead of proper mimetype for current mime part) and finally somehow it skips out of the loop but without proper content-type and content-encoding headers... My proposition is to add check if metadata already contains CONTENT_TYPE in AutoDetectPArser::parse and in such case do not override it. If this is not valid behavior in general - then RFC822Parser should use custom parser in MailContentHandler which respects passed content-type... -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (TIKA-1162) content-type/charset problem with RFC822Parser
[ https://issues.apache.org/jira/browse/TIKA-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13781922#comment-13781922 ] Tim Allison commented on TIKA-1162: --- Dear Colleague, I'm on paternity leave. Will be back part time on October 14. Best, Tim content-type/charset problem with RFC822Parser -- Key: TIKA-1162 URL: https://issues.apache.org/jira/browse/TIKA-1162 Project: Tika Issue Type: Bug Components: parser Reporter: Maciej Lizewski RFC822Parser (mime mail) uses MailContentHandler which internally uses AutoDetectParser to handle each mime part. The problem is that MailContentHandler reads mime part headers and sets CONTENT_TYPE and CONTENT_ENCODING metadata properly and passes this metadata to AutoDetectParser::parse method. But that method ignores those headers and overwrites it: MediaType type = this.getDetector().detect(tis, metadata); metadata.set(Metadata.CONTENT_TYPE, type.toString()); this leads to some additional recursion loops (Detector returns message/rfc822 mime type instead of proper mimetype for current mime part) and finally somehow it skips out of the loop but without proper content-type and content-encoding headers... My proposition is to add check if metadata already contains CONTENT_TYPE in AutoDetectPArser::parse and in such case do not override it. If this is not valid behavior in general - then RFC822Parser should use custom parser in MailContentHandler which respects passed content-type... -- This message was sent by Atlassian JIRA (v6.1#6144)
Re: problem with the inputstream after calling the detect(InputStream in) method
Hi On 30/09/13 15:49, kevin slote wrote: Ok, thanks. That was my problem. Also, I read your book and enjoyed it very much. Is this the forum where I could bring up an issue I found with the Tika-JAX-RS server? What kind of issue are you seeing ? Sergey On Mon, Sep 30, 2013 at 10:45 AM, Jukka Zitting jukka.zitt...@gmail.comwrote: Hi, On Mon, Sep 30, 2013 at 10:36 AM, kevin slote kslo...@gmail.com wrote: InputStream in= attachment.getFileInputStream(); [...] String mime = tika.detect(in); See the javadocs [1]: If the document stream supports the mark feature, then the stream is marked and reset to the original position before this method returns I believe the stream you're using does not support the mark feature (see [2]), which makes it impossible for Tika to restore the original state of the stream once type detection is done. Using BufferedInputStream [3] should fix your problem: InputStream in= new BufferedInputStream(attachment.getFileInputStream()); [1] http://tika.apache.org/1.4/api/org/apache/tika/Tika.html#detect(java.io.InputStream) [2] http://docs.oracle.com/javase/7/docs/api/java/io/InputStream.html#mark(int) [3] http://docs.oracle.com/javase/7/docs/api/java/io/BufferedInputStream.html BR, Jukka Zitting
Re: problem with the inputstream after calling the detect(InputStream in) method
Well, there was no error during runtime, it was just that the data was erased. After debugging it with a print statement, System.out.println(in.read());, I discovered that the InputStream was being erased after I called the detect(InputStream in) method. On Mon, Sep 30, 2013 at 11:19 AM, Sergey Beryozkin sberyoz...@gmail.comwrote: Hi On 30/09/13 15:49, kevin slote wrote: Ok, thanks. That was my problem. Also, I read your book and enjoyed it very much. Is this the forum where I could bring up an issue I found with the Tika-JAX-RS server? What kind of issue are you seeing ? Sergey On Mon, Sep 30, 2013 at 10:45 AM, Jukka Zitting jukka.zitt...@gmail.com **wrote: Hi, On Mon, Sep 30, 2013 at 10:36 AM, kevin slote kslo...@gmail.com wrote: InputStream in= attachment.getFileInputStream(**); [...] String mime = tika.detect(in); See the javadocs [1]: If the document stream supports the mark feature, then the stream is marked and reset to the original position before this method returns I believe the stream you're using does not support the mark feature (see [2]), which makes it impossible for Tika to restore the original state of the stream once type detection is done. Using BufferedInputStream [3] should fix your problem: InputStream in= new BufferedInputStream(**attachment.getFileInputStream(**)); [1] http://tika.apache.org/1.4/**api/org/apache/tika/Tika.html#** detect(java.io.InputStream)http://tika.apache.org/1.4/api/org/apache/tika/Tika.html#detect(java.io.InputStream) [2] http://docs.oracle.com/javase/**7/docs/api/java/io/** InputStream.html#mark(int)http://docs.oracle.com/javase/7/docs/api/java/io/InputStream.html#mark(int) [3] http://docs.oracle.com/javase/**7/docs/api/java/io/** BufferedInputStream.htmlhttp://docs.oracle.com/javase/7/docs/api/java/io/BufferedInputStream.html BR, Jukka Zitting