problem with the inputstream after calling the detect(InputStream in) method

2013-09-30 Thread kevin slote
Hi,  I have been using tika for a while now without any problems and I am a
big fan of the software.  I wanted to do my part and report what I suspect
might be a bug.

My code uses two different libraries, javaMail,  java-libpst, and I am unit
testing with dumpster.  When I send the email, the last unit test that I
built with dumbpster was to make sure that all of the attachments were
appended correctly, this failed.  After doing some nitty and gritty
debugging, I discovered that if I positioned a
System.out.println(in.read());   directly before where I was calling tika,
it would yield the correct number on the console.  However, if I used the
same command after where tiks was called for this case, it read -1.

public void sendAsEmail(PSTMessage email, String parent, String dir)
throws IOException, MessagingException, PSTException {
String subject = email.getSubject();
String to = primaryRecipientsEmail(email);
String from = email.getSenderEmailAddress();
if (!isValidEmailAddress(from)) {
from = emptyfromstr...@placeholder.com;
}
Properties props = new Properties();
props.put(mail.transport.protocol, smtp);
props.put(mail.smtp.host, localhost);
props.put(mail.smtp.auth, false);
props.put(mail.debug, false);
props.put(mail.smtp.port, 3025);//change back to 25

Session session = Session.getDefaultInstance(props);

Transport transport = session.getTransport(smtp);
transport.connect();

Message message = new MimeMessage(session);
message.addHeader(Parent-Info, parent);
message.addHeader(directory, dir);
message.setSubject(subject);
messageBodyPart.setText(email.getBody());
multipart.addBodyPart(messageBodyPart);
message.setFrom(new InternetAddress(from));
message.setRecipients(Message.RecipientType.TOhttp://message.recipienttype.to/,
InternetAddress
.parse(to));

try {
String transportHeaders = email.getTransportMessageHeaders();
String[] headers = parseTransporHeaders(transportHeaders);
for (String header : headers) {
messageBodyPart.addHeaderLine(header);
multipart.addBodyPart(messageBodyPart);
}
} catch (Exception e) {
log.info(missing chunk is transport headers:  + e);
}
try {
   if(email.hasAttachments()){
 int attachmentIndex = 0;
 while (attachmentIndex  email.getNumberOfAttachments()) {
 PSTAttachment attachment = email.getAttachment(attachmentIndex);
 InputStream in= attachment.getFileInputStream();
 if (attachment.getAttachMethod() !=
PSTAttachment.ATTACHMENT_METHOD_EMBEDDED
attachment.getAttachMethod() !=
PSTAttachment.ATTACHMENT_METHOD_OLE) {

  String filename = attachment.getFilename();
 String mime = tika.detect(in); //here is where I
called tika for use in a method that has since been depreciated.
 messageBodyPart = new MimeBodyPart();
 messageBodyPart.attachFile(file);
 messageBodyPart.setFileName(filename);
 multipart.addBodyPart(messageBodyPart);

 } else {
 log.info(not base 64 file:  + attachment.getFilename());
 }
 in.close();
 attachmentIndex++;
 }
}
}catch(Exception e){
log.info(failed attaching file to +e);
}
 message.setContent(multipart);
transport.sendMessage(message, message.getAllRecipients());
transport.close();
}

Following the advice of Ken Krugler, I though I would share this on this
list to see if it was an error in my code or an issue in tika.


Re: problem with the inputstream after calling the detect(InputStream in) method

2013-09-30 Thread kevin slote
Ok, thanks.  That was my problem.  Also, I read your book and enjoyed it
very much.  Is this the forum where I could bring up an issue I found with
the Tika-JAX-RS server?


On Mon, Sep 30, 2013 at 10:45 AM, Jukka Zitting jukka.zitt...@gmail.comwrote:

 Hi,

 On Mon, Sep 30, 2013 at 10:36 AM, kevin slote kslo...@gmail.com wrote:
  InputStream in= attachment.getFileInputStream();
  [...]
  String mime = tika.detect(in);

 See the javadocs [1]: If the document stream supports the mark
 feature, then the stream is marked and reset to the original position
 before this method returns

 I believe the stream you're using does not support the mark feature
 (see [2]), which makes it impossible for Tika to restore the original
 state of the stream once type detection is done.

 Using BufferedInputStream [3] should fix your problem:

 InputStream in= new
 BufferedInputStream(attachment.getFileInputStream());

 [1]
 http://tika.apache.org/1.4/api/org/apache/tika/Tika.html#detect(java.io.InputStream)
 [2]
 http://docs.oracle.com/javase/7/docs/api/java/io/InputStream.html#mark(int)
 [3]
 http://docs.oracle.com/javase/7/docs/api/java/io/BufferedInputStream.html

 BR,

 Jukka Zitting



[jira] [Closed] (TIKA-1162) content-type/charset problem with RFC822Parser

2013-09-30 Thread Maciej Lizewski (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Lizewski closed TIKA-1162.
-

Resolution: Cannot Reproduce

cannot reproduce on newer tika version, so probably it is not an issue any more.

 content-type/charset problem with RFC822Parser
 --

 Key: TIKA-1162
 URL: https://issues.apache.org/jira/browse/TIKA-1162
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Maciej Lizewski

 RFC822Parser (mime mail) uses MailContentHandler which internally uses 
 AutoDetectParser to handle each mime part. The problem is that 
 MailContentHandler reads mime part headers and sets CONTENT_TYPE and 
 CONTENT_ENCODING metadata properly and passes this metadata to 
 AutoDetectParser::parse method. But that method ignores those headers and 
 overwrites it:
 MediaType type = this.getDetector().detect(tis, metadata);
 metadata.set(Metadata.CONTENT_TYPE, type.toString());
 this leads to some additional recursion loops (Detector returns 
 message/rfc822 mime type instead of proper mimetype for current mime part) 
 and finally somehow it skips out of the loop but without proper content-type 
 and content-encoding headers...
 My proposition is to add check if metadata already contains CONTENT_TYPE in 
 AutoDetectPArser::parse and in such case do not override it. If this is not 
 valid behavior in general - then RFC822Parser should use custom parser in 
 MailContentHandler which respects passed content-type...



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (TIKA-1162) content-type/charset problem with RFC822Parser

2013-09-30 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13781922#comment-13781922
 ] 

Tim Allison commented on TIKA-1162:
---

Dear Colleague,
  I'm on paternity leave.  Will be back part time on October 14.

   Best,

Tim



 content-type/charset problem with RFC822Parser
 --

 Key: TIKA-1162
 URL: https://issues.apache.org/jira/browse/TIKA-1162
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Maciej Lizewski

 RFC822Parser (mime mail) uses MailContentHandler which internally uses 
 AutoDetectParser to handle each mime part. The problem is that 
 MailContentHandler reads mime part headers and sets CONTENT_TYPE and 
 CONTENT_ENCODING metadata properly and passes this metadata to 
 AutoDetectParser::parse method. But that method ignores those headers and 
 overwrites it:
 MediaType type = this.getDetector().detect(tis, metadata);
 metadata.set(Metadata.CONTENT_TYPE, type.toString());
 this leads to some additional recursion loops (Detector returns 
 message/rfc822 mime type instead of proper mimetype for current mime part) 
 and finally somehow it skips out of the loop but without proper content-type 
 and content-encoding headers...
 My proposition is to add check if metadata already contains CONTENT_TYPE in 
 AutoDetectPArser::parse and in such case do not override it. If this is not 
 valid behavior in general - then RFC822Parser should use custom parser in 
 MailContentHandler which respects passed content-type...



--
This message was sent by Atlassian JIRA
(v6.1#6144)


Re: problem with the inputstream after calling the detect(InputStream in) method

2013-09-30 Thread Sergey Beryozkin

Hi
On 30/09/13 15:49, kevin slote wrote:

Ok, thanks.  That was my problem.  Also, I read your book and enjoyed it
very much.  Is this the forum where I could bring up an issue I found with
the Tika-JAX-RS server?


What kind of issue are you seeing ?

Sergey


On Mon, Sep 30, 2013 at 10:45 AM, Jukka Zitting jukka.zitt...@gmail.comwrote:


Hi,

On Mon, Sep 30, 2013 at 10:36 AM, kevin slote kslo...@gmail.com wrote:

InputStream in= attachment.getFileInputStream();
[...]
String mime = tika.detect(in);


See the javadocs [1]: If the document stream supports the mark
feature, then the stream is marked and reset to the original position
before this method returns

I believe the stream you're using does not support the mark feature
(see [2]), which makes it impossible for Tika to restore the original
state of the stream once type detection is done.

Using BufferedInputStream [3] should fix your problem:

 InputStream in= new
BufferedInputStream(attachment.getFileInputStream());

[1]
http://tika.apache.org/1.4/api/org/apache/tika/Tika.html#detect(java.io.InputStream)
[2]
http://docs.oracle.com/javase/7/docs/api/java/io/InputStream.html#mark(int)
[3]
http://docs.oracle.com/javase/7/docs/api/java/io/BufferedInputStream.html

BR,

Jukka Zitting







Re: problem with the inputstream after calling the detect(InputStream in) method

2013-09-30 Thread kevin slote
Well, there was no error during runtime, it was just that the data was
erased.  After debugging it with a print statement,
System.out.println(in.read());,  I discovered that the InputStream was
being erased after I called the detect(InputStream in) method.


On Mon, Sep 30, 2013 at 11:19 AM, Sergey Beryozkin sberyoz...@gmail.comwrote:

 Hi

 On 30/09/13 15:49, kevin slote wrote:

 Ok, thanks.  That was my problem.  Also, I read your book and enjoyed it
 very much.  Is this the forum where I could bring up an issue I found with
 the Tika-JAX-RS server?

  What kind of issue are you seeing ?

 Sergey


 On Mon, Sep 30, 2013 at 10:45 AM, Jukka Zitting jukka.zitt...@gmail.com
 **wrote:

  Hi,

 On Mon, Sep 30, 2013 at 10:36 AM, kevin slote kslo...@gmail.com wrote:

 InputStream in= attachment.getFileInputStream(**);
 [...]
 String mime = tika.detect(in);


 See the javadocs [1]: If the document stream supports the mark
 feature, then the stream is marked and reset to the original position
 before this method returns

 I believe the stream you're using does not support the mark feature
 (see [2]), which makes it impossible for Tika to restore the original
 state of the stream once type detection is done.

 Using BufferedInputStream [3] should fix your problem:

  InputStream in= new
 BufferedInputStream(**attachment.getFileInputStream(**));

 [1]
 http://tika.apache.org/1.4/**api/org/apache/tika/Tika.html#**
 detect(java.io.InputStream)http://tika.apache.org/1.4/api/org/apache/tika/Tika.html#detect(java.io.InputStream)
 [2]
 http://docs.oracle.com/javase/**7/docs/api/java/io/**
 InputStream.html#mark(int)http://docs.oracle.com/javase/7/docs/api/java/io/InputStream.html#mark(int)
 [3]
 http://docs.oracle.com/javase/**7/docs/api/java/io/**
 BufferedInputStream.htmlhttp://docs.oracle.com/javase/7/docs/api/java/io/BufferedInputStream.html

 BR,

 Jukka Zitting