[jira] [Assigned] (TIKA-1584) Tika 1.7 possible regression (nested attachment files not getting parsed)

2015-03-28 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison reassigned TIKA-1584:
-

Assignee: Tim Allison

 Tika 1.7 possible regression (nested attachment files not getting parsed)
 -

 Key: TIKA-1584
 URL: https://issues.apache.org/jira/browse/TIKA-1584
 Project: Tika
  Issue Type: Bug
  Components: server
Affects Versions: 1.7
Reporter: Rob Tulloh
Assignee: Tim Allison
Priority: Blocker

 I tried to send this to the tika user list, but got a qmail failure so I am 
 opening a jira to see if I can get help with this.
 There appears to be a change in the behavior of tika since 1.5 (the last 
 version we have used). In 1.5, if we pass a file with content type of rfc822 
 which contains a zip that contains a docx file, the entire content would get 
 recursed and the text returned. In 1.7, tika only unwinds as far as the zip 
 file and ignores the content of the contained docx file. This is causing a 
 regression failure in our search tests because the contents of the docx file 
 are not found when searched for.
  
 We are testing with tika-server if this helps. If we ask the meta service to 
 just characterize the test data, it correctly determines the input is of type 
 rfc822. However, on extract, the contents of the attachment are not extracted 
 as expected.
 curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream  
 http://localhost:9998/meta 2/dev/null | grep Content-Type
 Content-Type,message/rfc822
 curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream  
 http://localhost:9998/tika 2/dev/null | grep docx
 sign.docx   --- this is not expected, need contents of this extracted
 We can easily reproduce this problem with a simple eml file with an 
 attachment. Can someone please comment if this seems like a problem or 
 perhaps we need to change something in our call to get the old behavior?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Review Request 32291: ISATab parsers (preliminary version)

2015-03-28 Thread Chris Mattmann

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/32291/#review78159
---

Ship it!


I have ran this in production and it works awesome! Thanks Giuseppe!

- Chris Mattmann


On March 23, 2015, 5:04 p.m., Giuseppe Totaro wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/32291/
 ---
 
 (Updated March 23, 2015, 5:04 p.m.)
 
 
 Review request for tika and Chris Mattmann.
 
 
 Bugs: TIKA-1580
 https://issues.apache.org/jira/browse/TIKA-1580
 
 
 Repository: tika
 
 
 Description
 ---
 
 ISATab parsers. This preliminary solution provides three parsers, one for 
 each ISA-Tab filetype (Investigation, Study, Assay).
 
 
 Diffs
 -
 
   trunk/tika-bundle/pom.xml 1668683 
   trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml 
 1668683 
   trunk/tika-parsers/pom.xml 1668683 
   
 trunk/tika-parsers/src/main/java/org/apache/tika/parser/isatab/ISATabAssayParser.java
  PRE-CREATION 
   
 trunk/tika-parsers/src/main/java/org/apache/tika/parser/isatab/ISATabInvestigationParser.java
  PRE-CREATION 
   
 trunk/tika-parsers/src/main/java/org/apache/tika/parser/isatab/ISATabStudyParser.java
  PRE-CREATION 
   
 trunk/tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser
  1668683 
   
 trunk/tika-parsers/src/test/java/org/apache/tika/parser/isatab/ISATabAssayParserTest.java
  PRE-CREATION 
   
 trunk/tika-parsers/src/test/java/org/apache/tika/parser/isatab/ISATabInvestigationParserTest.java
  PRE-CREATION 
   
 trunk/tika-parsers/src/test/java/org/apache/tika/parser/isatab/ISATabStudyParserTest.java
  PRE-CREATION 
   
 trunk/tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/a_bii-s-2_metabolite
  profiling_NMR spectroscopy.txt PRE-CREATION 
   
 trunk/tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/a_metabolome.txt
  PRE-CREATION 
   
 trunk/tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/a_microarray.txt
  PRE-CREATION 
   
 trunk/tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/a_proteome.txt
  PRE-CREATION 
   
 trunk/tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/a_transcriptome.txt
  PRE-CREATION 
   
 trunk/tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/i_investigation.txt
  PRE-CREATION 
   
 trunk/tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/s_BII-S-1.txt
  PRE-CREATION 
   
 trunk/tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/s_BII-S-2.txt
  PRE-CREATION 
 
 Diff: https://reviews.apache.org/r/32291/diff/
 
 
 Testing
 ---
 
 Tested on sample ISA-Tab files downloaded from 
 http://www.isa-tools.org/format/examples/.
 
 
 Thanks,
 
 Giuseppe Totaro
 




Re: Review Request 32291: ISATab parsers (preliminary version)

2015-03-28 Thread Chris Mattmann

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/32291/#review78158
---

Ship it!


Ship It!

- Chris Mattmann


On March 23, 2015, 5:04 p.m., Giuseppe Totaro wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/32291/
 ---
 
 (Updated March 23, 2015, 5:04 p.m.)
 
 
 Review request for tika and Chris Mattmann.
 
 
 Bugs: TIKA-1580
 https://issues.apache.org/jira/browse/TIKA-1580
 
 
 Repository: tika
 
 
 Description
 ---
 
 ISATab parsers. This preliminary solution provides three parsers, one for 
 each ISA-Tab filetype (Investigation, Study, Assay).
 
 
 Diffs
 -
 
   trunk/tika-bundle/pom.xml 1668683 
   trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml 
 1668683 
   trunk/tika-parsers/pom.xml 1668683 
   
 trunk/tika-parsers/src/main/java/org/apache/tika/parser/isatab/ISATabAssayParser.java
  PRE-CREATION 
   
 trunk/tika-parsers/src/main/java/org/apache/tika/parser/isatab/ISATabInvestigationParser.java
  PRE-CREATION 
   
 trunk/tika-parsers/src/main/java/org/apache/tika/parser/isatab/ISATabStudyParser.java
  PRE-CREATION 
   
 trunk/tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser
  1668683 
   
 trunk/tika-parsers/src/test/java/org/apache/tika/parser/isatab/ISATabAssayParserTest.java
  PRE-CREATION 
   
 trunk/tika-parsers/src/test/java/org/apache/tika/parser/isatab/ISATabInvestigationParserTest.java
  PRE-CREATION 
   
 trunk/tika-parsers/src/test/java/org/apache/tika/parser/isatab/ISATabStudyParserTest.java
  PRE-CREATION 
   
 trunk/tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/a_bii-s-2_metabolite
  profiling_NMR spectroscopy.txt PRE-CREATION 
   
 trunk/tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/a_metabolome.txt
  PRE-CREATION 
   
 trunk/tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/a_microarray.txt
  PRE-CREATION 
   
 trunk/tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/a_proteome.txt
  PRE-CREATION 
   
 trunk/tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/a_transcriptome.txt
  PRE-CREATION 
   
 trunk/tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/i_investigation.txt
  PRE-CREATION 
   
 trunk/tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/s_BII-S-1.txt
  PRE-CREATION 
   
 trunk/tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/s_BII-S-2.txt
  PRE-CREATION 
 
 Diff: https://reviews.apache.org/r/32291/diff/
 
 
 Testing
 ---
 
 Tested on sample ISA-Tab files downloaded from 
 http://www.isa-tools.org/format/examples/.
 
 
 Thanks,
 
 Giuseppe Totaro
 




[jira] [Updated] (TIKA-1584) Tika 1.7 possible regression (nested attachment files not getting parsed)

2015-03-28 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1584:
--
Priority: Blocker  (was: Major)

 Tika 1.7 possible regression (nested attachment files not getting parsed)
 -

 Key: TIKA-1584
 URL: https://issues.apache.org/jira/browse/TIKA-1584
 Project: Tika
  Issue Type: Bug
  Components: server
Affects Versions: 1.7
Reporter: Rob Tulloh
Priority: Blocker

 I tried to send this to the tika user list, but got a qmail failure so I am 
 opening a jira to see if I can get help with this.
 There appears to be a change in the behavior of tika since 1.5 (the last 
 version we have used). In 1.5, if we pass a file with content type of rfc822 
 which contains a zip that contains a docx file, the entire content would get 
 recursed and the text returned. In 1.7, tika only unwinds as far as the zip 
 file and ignores the content of the contained docx file. This is causing a 
 regression failure in our search tests because the contents of the docx file 
 are not found when searched for.
  
 We are testing with tika-server if this helps. If we ask the meta service to 
 just characterize the test data, it correctly determines the input is of type 
 rfc822. However, on extract, the contents of the attachment are not extracted 
 as expected.
 curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream  
 http://localhost:9998/meta 2/dev/null | grep Content-Type
 Content-Type,message/rfc822
 curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream  
 http://localhost:9998/tika 2/dev/null | grep docx
 sign.docx   --- this is not expected, need contents of this extracted
 We can easily reproduce this problem with a simple eml file with an 
 attachment. Can someone please comment if this seems like a problem or 
 perhaps we need to change something in our call to get the old behavior?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1575) Upgrade to PDFBox 1.8.9 when available

2015-03-28 Thread Konstantin Gribov (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385561#comment-14385561
 ] 

Konstantin Gribov commented on TIKA-1575:
-

What about updating to released pdfbox 1.8.9? 

Extracting from {{966679.pdf}} (PDFBOX-2261) hangs on trunk Tika w/ 1.8.8. And 
I see no difference when extracting {{10-814_Appendix B_v3.pdf}}, so form 
extraction issue seems to be fixed.

My questing is related to discuss about releasing Tika 1.8/1.7.1, see dev@ 
(https://mail-archives.apache.org/mod_mbox/tika-dev/201503.mbox/%3CCAM%3DrFA6vvFV3XqpvSNCrubrVHhVO%3Dq%2BighPMRUkmA9f-fKkSXA%40mail.gmail.com%3E)

 Upgrade to PDFBox 1.8.9 when available
 --

 Key: TIKA-1575
 URL: https://issues.apache.org/jira/browse/TIKA-1575
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Priority: Minor
 Attachments: 005937.pdf.json, 005937_1_8_9-SNAPSHOT.pdf.json, 
 10-814_Appendix B_v3.pdf, 524276_719128_diffs.zip, 
 PDFBox_1_8_8VPDFBox_1_8_9-SNAPSHOT.xlsx, 
 PDFBox_1_8_8VPDFBox_1_8_9-SNAPSHOT_reports.zip, 
 PDFBox_1_8_8Vs1_8_9_20150316.zip, caught_ex_1_8_9.zip, 
 content_diffs_20150316.xlsx, diffs_1_8_9_multithread_vs_single_thread.xlsx, 
 reports_1_8_9_multithread_vs_single.zip


 The PDFBox community is about to release 1.8.9.  Let's use this issue to 
 track discussions before the release and to track Tika's upgrade to PDFBox 
 1.8.9



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1584) Tika 1.7 possible regression (nested attachment files not getting parsed)

2015-03-28 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385477#comment-14385477
 ] 

Tim Allison commented on TIKA-1584:
---

IMHO this is major enough for a fix asap. Whether that's 1.7.1 with just this 
fix or a full cut of trunk as 1.8 is up to all devs. Tika colleagues, what do 
you think?

 Tika 1.7 possible regression (nested attachment files not getting parsed)
 -

 Key: TIKA-1584
 URL: https://issues.apache.org/jira/browse/TIKA-1584
 Project: Tika
  Issue Type: Bug
  Components: server
Affects Versions: 1.7
Reporter: Rob Tulloh
Assignee: Tim Allison
Priority: Blocker

 I tried to send this to the tika user list, but got a qmail failure so I am 
 opening a jira to see if I can get help with this.
 There appears to be a change in the behavior of tika since 1.5 (the last 
 version we have used). In 1.5, if we pass a file with content type of rfc822 
 which contains a zip that contains a docx file, the entire content would get 
 recursed and the text returned. In 1.7, tika only unwinds as far as the zip 
 file and ignores the content of the contained docx file. This is causing a 
 regression failure in our search tests because the contents of the docx file 
 are not found when searched for.
  
 We are testing with tika-server if this helps. If we ask the meta service to 
 just characterize the test data, it correctly determines the input is of type 
 rfc822. However, on extract, the contents of the attachment are not extracted 
 as expected.
 curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream  
 http://localhost:9998/meta 2/dev/null | grep Content-Type
 Content-Type,message/rfc822
 curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream  
 http://localhost:9998/tika 2/dev/null | grep docx
 sign.docx   --- this is not expected, need contents of this extracted
 We can easily reproduce this problem with a simple eml file with an 
 attachment. Can someone please comment if this seems like a problem or 
 perhaps we need to change something in our call to get the old behavior?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: [DISCUSS] Tika 1.8 or 1.7.1

2015-03-28 Thread Konstantin Gribov
Also, I think, we should resolve TIKA-1575 (upgrade to pdfbox 1.8.9) since
pdfbox 1.8.8 hangs on some pdf forms.

-- 
Best regards,
Konstantin Gribov

сб, 28 марта 2015 г. в 23:22, Konstantin Gribov gros...@gmail.com:

 +1 to releasing 1.8.

 --
 Best regards,
 Konstantin Gribov

 сб, 28 марта 2015, 22:25, Tyler Palsulich tpalsul...@apache.org:

 I'm also leaning toward 1.8. Especially given the newly identified
 regression in TIKA-1584.

 Tyler
 On Mar 28, 2015 11:47 AM, Mattmann, Chris A (3980) 
 chris.a.mattm...@jpl.nasa.gov wrote:

  Hi Tyler - I would VOTE for 1.8. Given the stuff associated
  with releasing (updating the website; sending emails; waiting
  periods, etc.) let’s ship all the updates we have too along
  with the jhighlight fix.
 
  Cheers,
  Chris
 
  ++
  Chris Mattmann, Ph.D.
  Chief Architect
  Instrument Software and Science Data Systems Section (398)
  NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
  Office: 168-519, Mailstop: 168-527
  Email: chris.a.mattm...@nasa.gov
  WWW:  http://sunset.usc.edu/~mattmann/
  ++
  Adjunct Associate Professor, Computer Science Department
  University of Southern California, Los Angeles, CA 90089 USA
  ++
 
 
 
 
 
 
  -Original Message-
  From: Tyler Palsulich tpalsul...@apache.org
  Reply-To: dev@tika.apache.org dev@tika.apache.org
  Date: Saturday, March 28, 2015 at 8:01 AM
  To: dev@tika.apache.org dev@tika.apache.org
  Subject: [DISCUSS] Tika 1.8 or 1.7.1
 
  Hi Folks,
  
  Now that TIKA-1581 (JHighlight licensing issues) is resolved, we need
 to
  release a new version of Tika. I'll volunteer to be the release manager
  again.
  
  Should we release this as 1.8 or 1.7.1?
  
  Does anyone have any last minute issues they'd like to finish and see
 in
  Tika 1.X? I'd like to get the example working with CORS (TIKA-1585 and
  TIKA-1586). Any others?
  
  Have a good weekend,
  Tyler
 
 




Re: [DISCUSS] Tika 1.8 or 1.7.1

2015-03-28 Thread Allison, Timothy B.
Once we fix TIKA-1584, I don't have a preference.  I defer to Chris's 
experience (so I guess, +1 for 1.8) given the amount of work required.

It'd be great if we could make sure we aren't bundling any pdfs in our tika-app 
jar, too.  Many apologies if that's been fixed!


From: Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov
Sent: Saturday, March 28, 2015 11:41 AM
To: dev@tika.apache.org
Subject: Re: [DISCUSS] Tika 1.8 or 1.7.1

Hi Tyler - I would VOTE for 1.8. Given the stuff associated
with releasing (updating the website; sending emails; waiting
periods, etc.) let’s ship all the updates we have too along
with the jhighlight fix.

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++






-Original Message-
From: Tyler Palsulich tpalsul...@apache.org
Reply-To: dev@tika.apache.org dev@tika.apache.org
Date: Saturday, March 28, 2015 at 8:01 AM
To: dev@tika.apache.org dev@tika.apache.org
Subject: [DISCUSS] Tika 1.8 or 1.7.1

Hi Folks,

Now that TIKA-1581 (JHighlight licensing issues) is resolved, we need to
release a new version of Tika. I'll volunteer to be the release manager
again.

Should we release this as 1.8 or 1.7.1?

Does anyone have any last minute issues they'd like to finish and see in
Tika 1.X? I'd like to get the example working with CORS (TIKA-1585 and
TIKA-1586). Any others?

Have a good weekend,
Tyler



[jira] [Commented] (TIKA-1584) Tika 1.7 possible regression (nested attachment files not getting parsed)

2015-03-28 Thread Rob Tulloh (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385494#comment-14385494
 ] 

Rob Tulloh commented on TIKA-1584:
--

I would vote for a release as we have been waiting for tika-1371and were hoping 
to upgrade to 1.7

 Tika 1.7 possible regression (nested attachment files not getting parsed)
 -

 Key: TIKA-1584
 URL: https://issues.apache.org/jira/browse/TIKA-1584
 Project: Tika
  Issue Type: Bug
  Components: server
Affects Versions: 1.7
Reporter: Rob Tulloh
Assignee: Tim Allison
Priority: Blocker

 I tried to send this to the tika user list, but got a qmail failure so I am 
 opening a jira to see if I can get help with this.
 There appears to be a change in the behavior of tika since 1.5 (the last 
 version we have used). In 1.5, if we pass a file with content type of rfc822 
 which contains a zip that contains a docx file, the entire content would get 
 recursed and the text returned. In 1.7, tika only unwinds as far as the zip 
 file and ignores the content of the contained docx file. This is causing a 
 regression failure in our search tests because the contents of the docx file 
 are not found when searched for.
  
 We are testing with tika-server if this helps. If we ask the meta service to 
 just characterize the test data, it correctly determines the input is of type 
 rfc822. However, on extract, the contents of the attachment are not extracted 
 as expected.
 curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream  
 http://localhost:9998/meta 2/dev/null | grep Content-Type
 Content-Type,message/rfc822
 curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream  
 http://localhost:9998/tika 2/dev/null | grep docx
 sign.docx   --- this is not expected, need contents of this extracted
 We can easily reproduce this problem with a simple eml file with an 
 attachment. Can someone please comment if this seems like a problem or 
 perhaps we need to change something in our call to get the old behavior?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: [DISCUSS] Tika 1.8 or 1.7.1

2015-03-28 Thread Tyler Palsulich
I'm also leaning toward 1.8. Especially given the newly identified
regression in TIKA-1584.

Tyler
On Mar 28, 2015 11:47 AM, Mattmann, Chris A (3980) 
chris.a.mattm...@jpl.nasa.gov wrote:

 Hi Tyler - I would VOTE for 1.8. Given the stuff associated
 with releasing (updating the website; sending emails; waiting
 periods, etc.) let’s ship all the updates we have too along
 with the jhighlight fix.

 Cheers,
 Chris

 ++
 Chris Mattmann, Ph.D.
 Chief Architect
 Instrument Software and Science Data Systems Section (398)
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 168-519, Mailstop: 168-527
 Email: chris.a.mattm...@nasa.gov
 WWW:  http://sunset.usc.edu/~mattmann/
 ++
 Adjunct Associate Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++






 -Original Message-
 From: Tyler Palsulich tpalsul...@apache.org
 Reply-To: dev@tika.apache.org dev@tika.apache.org
 Date: Saturday, March 28, 2015 at 8:01 AM
 To: dev@tika.apache.org dev@tika.apache.org
 Subject: [DISCUSS] Tika 1.8 or 1.7.1

 Hi Folks,
 
 Now that TIKA-1581 (JHighlight licensing issues) is resolved, we need to
 release a new version of Tika. I'll volunteer to be the release manager
 again.
 
 Should we release this as 1.8 or 1.7.1?
 
 Does anyone have any last minute issues they'd like to finish and see in
 Tika 1.X? I'd like to get the example working with CORS (TIKA-1585 and
 TIKA-1586). Any others?
 
 Have a good weekend,
 Tyler




[jira] [Commented] (TIKA-1580) ISA-Tab parsers

2015-03-28 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385541#comment-14385541
 ] 

Chris A. Mattmann commented on TIKA-1580:
-

Committed in r1669839.

Thank you [~gostep] you did amazing on this!

{noformat}
[chipotle:~/tmp/tika] mattmann% svn commit -m Fix for TIKA-1580: Support 
IsaTab MIME identification and parsing. Thanks to Giuseppe Totaro for all the 
great work!
SendingCHANGES.txt
Sendingtika-bundle/pom.xml
Sending
tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
Sendingtika-parsers/pom.xml
Adding tika-parsers/src/main/java/org/apache/tika/parser/isatab
Adding 
tika-parsers/src/main/java/org/apache/tika/parser/isatab/ISATabUtils.java
Adding 
tika-parsers/src/main/java/org/apache/tika/parser/isatab/ISArchiveParser.java
Sending
tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser
Adding tika-parsers/src/test/java/org/apache/tika/parser/isatab
Adding 
tika-parsers/src/test/java/org/apache/tika/parser/isatab/ISArchiveParserTest.java
Adding tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1
Adding 
tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/a_bii-s-2_metabolite
 profiling_NMR spectroscopy.txt
Adding 
tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/a_metabolome.txt
Adding 
tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/a_microarray.txt
Adding 
tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/a_proteome.txt
Adding 
tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/a_transcriptome.txt
Adding 
tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/i_investigation.txt
Adding 
tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/s_BII-S-1.txt
Adding 
tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/s_BII-S-2.txt
Transmitting file data 
Committed revision 1669839.
[chipotle:~/tmp/tika] mattmann% 
{noformat}

 ISA-Tab parsers
 ---

 Key: TIKA-1580
 URL: https://issues.apache.org/jira/browse/TIKA-1580
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Giuseppe Totaro
Assignee: Chris A. Mattmann
Priority: Minor
  Labels: new-parser
 Fix For: 1.8

 Attachments: TIKA-1580.Mattmann.Totaro.032515.patch.txt, 
 TIKA-1580.patch, TIKA-1580.v02.patch, 
 TIKA-1580.v03.2.Mattmann.Totaro.03262015.patch


 We are going to add parsers for ISA-Tab data formats.
 ISA-Tab files are related to [ISA Tools|http://www.isa-tools.org/] which help 
 to manage an increasingly diverse set of life science, environmental and 
 biomedical experiments that employing one or a combination of technologies.
 The ISA tools are built upon _Investigation_, _Study_, and _Assay_ tabular 
 format. Therefore, ISA-Tab data format includes three types of file: 
 Investigation file ({{a_.txt}}), Study file ({{s_.txt}}), Assay file 
 ({{a_.txt}}). These files are organized as [top-down 
 hierarchy|http://www.isa-tools.org/format/specification/]: An Investigation 
 file includes one or more Study files: each Study files includes one or more 
 Assay files.
 Essentially, the Investigation files contains high-level information about 
 the related study, so it provides only metadata about ISA-Tab files.
 More details on file format specification are [available 
 online|http://isatab.sourceforge.net/docs/ISA-TAB_release-candidate-1_v1.0_24nov08.pdf].
 The patch in attachment provides a preliminary version of ISA-Tab parsers 
 (there are three parsers; one parser for each ISA-Tab filetype):
 * {{ISATabInvestigationParser.java}}: parses Investigation files. It extracts 
 only metadata.
 * {{ISATabStudyParser.java}}: parses Study files.
 * {{ISATabAssayParser.java}}: parses Assay files.
 The most important improvements are:
 * Combine these three parsers in order to parse an ISArchive
 * Provide a better mapping of both study and assay data on XHML. Currently, 
 {{ISATabStudyParser}} and {{ISATabAssayParser}} provide a naive mapping 
 function relying on [Apache Commons 
 CSV|https://commons.apache.org/proper/commons-csv/].
 Thanks for supporting me on this work [~chrismattmann]. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-1580) ISA-Tab parsers

2015-03-28 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann resolved TIKA-1580.
-
Resolution: Fixed

 ISA-Tab parsers
 ---

 Key: TIKA-1580
 URL: https://issues.apache.org/jira/browse/TIKA-1580
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Giuseppe Totaro
Assignee: Chris A. Mattmann
Priority: Minor
  Labels: new-parser
 Fix For: 1.8

 Attachments: TIKA-1580.Mattmann.Totaro.032515.patch.txt, 
 TIKA-1580.patch, TIKA-1580.v02.patch, 
 TIKA-1580.v03.2.Mattmann.Totaro.03262015.patch


 We are going to add parsers for ISA-Tab data formats.
 ISA-Tab files are related to [ISA Tools|http://www.isa-tools.org/] which help 
 to manage an increasingly diverse set of life science, environmental and 
 biomedical experiments that employing one or a combination of technologies.
 The ISA tools are built upon _Investigation_, _Study_, and _Assay_ tabular 
 format. Therefore, ISA-Tab data format includes three types of file: 
 Investigation file ({{a_.txt}}), Study file ({{s_.txt}}), Assay file 
 ({{a_.txt}}). These files are organized as [top-down 
 hierarchy|http://www.isa-tools.org/format/specification/]: An Investigation 
 file includes one or more Study files: each Study files includes one or more 
 Assay files.
 Essentially, the Investigation files contains high-level information about 
 the related study, so it provides only metadata about ISA-Tab files.
 More details on file format specification are [available 
 online|http://isatab.sourceforge.net/docs/ISA-TAB_release-candidate-1_v1.0_24nov08.pdf].
 The patch in attachment provides a preliminary version of ISA-Tab parsers 
 (there are three parsers; one parser for each ISA-Tab filetype):
 * {{ISATabInvestigationParser.java}}: parses Investigation files. It extracts 
 only metadata.
 * {{ISATabStudyParser.java}}: parses Study files.
 * {{ISATabAssayParser.java}}: parses Assay files.
 The most important improvements are:
 * Combine these three parsers in order to parse an ISArchive
 * Provide a better mapping of both study and assay data on XHML. Currently, 
 {{ISATabStudyParser}} and {{ISATabAssayParser}} provide a naive mapping 
 function relying on [Apache Commons 
 CSV|https://commons.apache.org/proper/commons-csv/].
 Thanks for supporting me on this work [~chrismattmann]. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1584) Tika 1.7 possible regression (nested attachment files not getting parsed)

2015-03-28 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385463#comment-14385463
 ] 

Tim Allison edited comment on TIKA-1584 at 3/28/15 6:53 PM:


Just checked svn. That's a major regression added in 1.7 when we added 
specification of ParseContext. We need to add the Parser to the ParseContext to 
get recursive parsing. W/o use of ParseContext in call to parse, the parser 
used to work recursively. Will fix Monday unless someone beats me to it. Thank 
you for raising this. No need to attach test doc.


was (Author: talli...@mitre.org):
Just checked svn. That's a major regression added in 1.7 when we added 
specification of ParseContext. We need to add the Parser to the ParseContext to 
get recursive parsing. W/o use of ParseContext in call to parse, the parser 
works recursively. Will fix Monday unless someone beats me to it. Thank you for 
raising this. No need to attach test doc.

 Tika 1.7 possible regression (nested attachment files not getting parsed)
 -

 Key: TIKA-1584
 URL: https://issues.apache.org/jira/browse/TIKA-1584
 Project: Tika
  Issue Type: Bug
  Components: server
Affects Versions: 1.7
Reporter: Rob Tulloh
Priority: Blocker

 I tried to send this to the tika user list, but got a qmail failure so I am 
 opening a jira to see if I can get help with this.
 There appears to be a change in the behavior of tika since 1.5 (the last 
 version we have used). In 1.5, if we pass a file with content type of rfc822 
 which contains a zip that contains a docx file, the entire content would get 
 recursed and the text returned. In 1.7, tika only unwinds as far as the zip 
 file and ignores the content of the contained docx file. This is causing a 
 regression failure in our search tests because the contents of the docx file 
 are not found when searched for.
  
 We are testing with tika-server if this helps. If we ask the meta service to 
 just characterize the test data, it correctly determines the input is of type 
 rfc822. However, on extract, the contents of the attachment are not extracted 
 as expected.
 curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream  
 http://localhost:9998/meta 2/dev/null | grep Content-Type
 Content-Type,message/rfc822
 curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream  
 http://localhost:9998/tika 2/dev/null | grep docx
 sign.docx   --- this is not expected, need contents of this extracted
 We can easily reproduce this problem with a simple eml file with an 
 attachment. Can someone please comment if this seems like a problem or 
 perhaps we need to change something in our call to get the old behavior?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1584) Tika 1.7 possible regression (nested attachment files not getting parsed)

2015-03-28 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385463#comment-14385463
 ] 

Tim Allison commented on TIKA-1584:
---

Just checked svn. That's a major regression added in 1.7 when we added 
specification of ParseContext. We need to add the Parser to the ParseContext to 
get recursive parsing. W/o use of ParseContext in call to parse, the parser 
works recursively. Will fix Monday unless someone beats me to it. Thank you for 
raising this. No need to attach test doc.

 Tika 1.7 possible regression (nested attachment files not getting parsed)
 -

 Key: TIKA-1584
 URL: https://issues.apache.org/jira/browse/TIKA-1584
 Project: Tika
  Issue Type: Bug
  Components: server
Affects Versions: 1.7
Reporter: Rob Tulloh

 I tried to send this to the tika user list, but got a qmail failure so I am 
 opening a jira to see if I can get help with this.
 There appears to be a change in the behavior of tika since 1.5 (the last 
 version we have used). In 1.5, if we pass a file with content type of rfc822 
 which contains a zip that contains a docx file, the entire content would get 
 recursed and the text returned. In 1.7, tika only unwinds as far as the zip 
 file and ignores the content of the contained docx file. This is causing a 
 regression failure in our search tests because the contents of the docx file 
 are not found when searched for.
  
 We are testing with tika-server if this helps. If we ask the meta service to 
 just characterize the test data, it correctly determines the input is of type 
 rfc822. However, on extract, the contents of the attachment are not extracted 
 as expected.
 curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream  
 http://localhost:9998/meta 2/dev/null | grep Content-Type
 Content-Type,message/rfc822
 curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream  
 http://localhost:9998/tika 2/dev/null | grep docx
 sign.docx   --- this is not expected, need contents of this extracted
 We can easily reproduce this problem with a simple eml file with an 
 attachment. Can someone please comment if this seems like a problem or 
 perhaps we need to change something in our call to get the old behavior?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1584) Tika 1.7 possible regression (nested attachment files not getting parsed)

2015-03-28 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385483#comment-14385483
 ] 

Tyler Palsulich commented on TIKA-1584:
---

We now have two major issues which need a quick release. So, I would say go for 
1.8. Tim, can you chime in on the current discuss thread?

 Tika 1.7 possible regression (nested attachment files not getting parsed)
 -

 Key: TIKA-1584
 URL: https://issues.apache.org/jira/browse/TIKA-1584
 Project: Tika
  Issue Type: Bug
  Components: server
Affects Versions: 1.7
Reporter: Rob Tulloh
Assignee: Tim Allison
Priority: Blocker

 I tried to send this to the tika user list, but got a qmail failure so I am 
 opening a jira to see if I can get help with this.
 There appears to be a change in the behavior of tika since 1.5 (the last 
 version we have used). In 1.5, if we pass a file with content type of rfc822 
 which contains a zip that contains a docx file, the entire content would get 
 recursed and the text returned. In 1.7, tika only unwinds as far as the zip 
 file and ignores the content of the contained docx file. This is causing a 
 regression failure in our search tests because the contents of the docx file 
 are not found when searched for.
  
 We are testing with tika-server if this helps. If we ask the meta service to 
 just characterize the test data, it correctly determines the input is of type 
 rfc822. However, on extract, the contents of the attachment are not extracted 
 as expected.
 curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream  
 http://localhost:9998/meta 2/dev/null | grep Content-Type
 Content-Type,message/rfc822
 curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream  
 http://localhost:9998/tika 2/dev/null | grep docx
 sign.docx   --- this is not expected, need contents of this extracted
 We can easily reproduce this problem with a simple eml file with an 
 attachment. Can someone please comment if this seems like a problem or 
 perhaps we need to change something in our call to get the old behavior?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1580) ISA-Tab parsers

2015-03-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385558#comment-14385558
 ] 

Hudson commented on TIKA-1580:
--

SUCCESS: Integrated in tika-trunk-jdk1.7 #579 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/579/])
Fix for TIKA-1580: Support IsaTab MIME identification and parsing. Thanks to 
Giuseppe Totaro for all the great work! (mattmann: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1669839)
* /tika/trunk/CHANGES.txt
* /tika/trunk/tika-bundle/pom.xml
* 
/tika/trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
* /tika/trunk/tika-parsers/pom.xml
* /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/isatab
* 
/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/isatab/ISATabUtils.java
* 
/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/isatab/ISArchiveParser.java
* 
/tika/trunk/tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser
* /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/isatab
* 
/tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/isatab/ISArchiveParserTest.java
* /tika/trunk/tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1
* 
/tika/trunk/tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/a_bii-s-2_metabolite
 profiling_NMR spectroscopy.txt
* 
/tika/trunk/tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/a_metabolome.txt
* 
/tika/trunk/tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/a_microarray.txt
* 
/tika/trunk/tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/a_proteome.txt
* 
/tika/trunk/tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/a_transcriptome.txt
* 
/tika/trunk/tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/i_investigation.txt
* 
/tika/trunk/tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/s_BII-S-1.txt
* 
/tika/trunk/tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/s_BII-S-2.txt


 ISA-Tab parsers
 ---

 Key: TIKA-1580
 URL: https://issues.apache.org/jira/browse/TIKA-1580
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Giuseppe Totaro
Assignee: Chris A. Mattmann
Priority: Minor
  Labels: new-parser
 Fix For: 1.8

 Attachments: TIKA-1580.Mattmann.Totaro.032515.patch.txt, 
 TIKA-1580.patch, TIKA-1580.v02.patch, 
 TIKA-1580.v03.2.Mattmann.Totaro.03262015.patch


 We are going to add parsers for ISA-Tab data formats.
 ISA-Tab files are related to [ISA Tools|http://www.isa-tools.org/] which help 
 to manage an increasingly diverse set of life science, environmental and 
 biomedical experiments that employing one or a combination of technologies.
 The ISA tools are built upon _Investigation_, _Study_, and _Assay_ tabular 
 format. Therefore, ISA-Tab data format includes three types of file: 
 Investigation file ({{a_.txt}}), Study file ({{s_.txt}}), Assay file 
 ({{a_.txt}}). These files are organized as [top-down 
 hierarchy|http://www.isa-tools.org/format/specification/]: An Investigation 
 file includes one or more Study files: each Study files includes one or more 
 Assay files.
 Essentially, the Investigation files contains high-level information about 
 the related study, so it provides only metadata about ISA-Tab files.
 More details on file format specification are [available 
 online|http://isatab.sourceforge.net/docs/ISA-TAB_release-candidate-1_v1.0_24nov08.pdf].
 The patch in attachment provides a preliminary version of ISA-Tab parsers 
 (there are three parsers; one parser for each ISA-Tab filetype):
 * {{ISATabInvestigationParser.java}}: parses Investigation files. It extracts 
 only metadata.
 * {{ISATabStudyParser.java}}: parses Study files.
 * {{ISATabAssayParser.java}}: parses Assay files.
 The most important improvements are:
 * Combine these three parsers in order to parse an ISArchive
 * Provide a better mapping of both study and assay data on XHML. Currently, 
 {{ISATabStudyParser}} and {{ISATabAssayParser}} provide a naive mapping 
 function relying on [Apache Commons 
 CSV|https://commons.apache.org/proper/commons-csv/].
 Thanks for supporting me on this work [~chrismattmann]. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1584) Tika 1.7 possible regression (nested attachment files not getting parsed)

2015-03-28 Thread Rob Tulloh (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385472#comment-14385472
 ] 

Rob Tulloh commented on TIKA-1584:
--

Thank you. For what it's worth, it easy to reproduce. Just zip any document you 
want and then pass the zip file to tika server and see what it gives back. As 
1.7 is released, does this mean that this won't be fixed until 1.8 or would 1.7 
get re-released/patched?

 Tika 1.7 possible regression (nested attachment files not getting parsed)
 -

 Key: TIKA-1584
 URL: https://issues.apache.org/jira/browse/TIKA-1584
 Project: Tika
  Issue Type: Bug
  Components: server
Affects Versions: 1.7
Reporter: Rob Tulloh
Priority: Blocker

 I tried to send this to the tika user list, but got a qmail failure so I am 
 opening a jira to see if I can get help with this.
 There appears to be a change in the behavior of tika since 1.5 (the last 
 version we have used). In 1.5, if we pass a file with content type of rfc822 
 which contains a zip that contains a docx file, the entire content would get 
 recursed and the text returned. In 1.7, tika only unwinds as far as the zip 
 file and ignores the content of the contained docx file. This is causing a 
 regression failure in our search tests because the contents of the docx file 
 are not found when searched for.
  
 We are testing with tika-server if this helps. If we ask the meta service to 
 just characterize the test data, it correctly determines the input is of type 
 rfc822. However, on extract, the contents of the attachment are not extracted 
 as expected.
 curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream  
 http://localhost:9998/meta 2/dev/null | grep Content-Type
 Content-Type,message/rfc822
 curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream  
 http://localhost:9998/tika 2/dev/null | grep docx
 sign.docx   --- this is not expected, need contents of this extracted
 We can easily reproduce this problem with a simple eml file with an 
 attachment. Can someone please comment if this seems like a problem or 
 perhaps we need to change something in our call to get the old behavior?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1580) ISA-Tab parsers

2015-03-28 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385540#comment-14385540
 ] 

Chris A. Mattmann commented on TIKA-1580:
-

Built and unit tested successfully. Deployed in production on bioinformatics 
project. Works great!
Committing now.
{noformat}
[INFO] 
[INFO] Reactor Summary:
[INFO] 
[INFO] Apache Tika parent . SUCCESS [  1.994 s]
[INFO] Apache Tika core ... SUCCESS [ 19.033 s]
[INFO] Apache Tika parsers  SUCCESS [02:29 min]
[INFO] Apache Tika XMP  SUCCESS [  2.675 s]
[INFO] Apache Tika serialization .. SUCCESS [  2.082 s]
[INFO] Apache Tika batch .. SUCCESS [01:57 min]
[INFO] Apache Tika application  SUCCESS [ 13.227 s]
[INFO] Apache Tika OSGi bundle  SUCCESS [ 18.103 s]
[INFO] Apache Tika server . SUCCESS [ 22.457 s]
[INFO] Apache Tika translate .. SUCCESS [  3.347 s]
[INFO] Apache Tika examples ... SUCCESS [  6.103 s]
[INFO] Apache Tika Java-7 Components .. SUCCESS [  2.039 s]
[INFO] Apache Tika  SUCCESS [  0.030 s]
[INFO] 
[INFO] BUILD SUCCESS
[INFO] 
[INFO] Total time: 05:58 min
[INFO] Finished at: 2015-03-28T14:45:20-07:00
[INFO] Final Memory: 103M/1592M
[INFO] 
[chipotle:~/tmp/tika] mattmann% 
{noformat}

 ISA-Tab parsers
 ---

 Key: TIKA-1580
 URL: https://issues.apache.org/jira/browse/TIKA-1580
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Giuseppe Totaro
Assignee: Chris A. Mattmann
Priority: Minor
  Labels: new-parser
 Fix For: 1.8

 Attachments: TIKA-1580.Mattmann.Totaro.032515.patch.txt, 
 TIKA-1580.patch, TIKA-1580.v02.patch, 
 TIKA-1580.v03.2.Mattmann.Totaro.03262015.patch


 We are going to add parsers for ISA-Tab data formats.
 ISA-Tab files are related to [ISA Tools|http://www.isa-tools.org/] which help 
 to manage an increasingly diverse set of life science, environmental and 
 biomedical experiments that employing one or a combination of technologies.
 The ISA tools are built upon _Investigation_, _Study_, and _Assay_ tabular 
 format. Therefore, ISA-Tab data format includes three types of file: 
 Investigation file ({{a_.txt}}), Study file ({{s_.txt}}), Assay file 
 ({{a_.txt}}). These files are organized as [top-down 
 hierarchy|http://www.isa-tools.org/format/specification/]: An Investigation 
 file includes one or more Study files: each Study files includes one or more 
 Assay files.
 Essentially, the Investigation files contains high-level information about 
 the related study, so it provides only metadata about ISA-Tab files.
 More details on file format specification are [available 
 online|http://isatab.sourceforge.net/docs/ISA-TAB_release-candidate-1_v1.0_24nov08.pdf].
 The patch in attachment provides a preliminary version of ISA-Tab parsers 
 (there are three parsers; one parser for each ISA-Tab filetype):
 * {{ISATabInvestigationParser.java}}: parses Investigation files. It extracts 
 only metadata.
 * {{ISATabStudyParser.java}}: parses Study files.
 * {{ISATabAssayParser.java}}: parses Assay files.
 The most important improvements are:
 * Combine these three parsers in order to parse an ISArchive
 * Provide a better mapping of both study and assay data on XHML. Currently, 
 {{ISATabStudyParser}} and {{ISATabAssayParser}} provide a naive mapping 
 function relying on [Apache Commons 
 CSV|https://commons.apache.org/proper/commons-csv/].
 Thanks for supporting me on this work [~chrismattmann]. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-1584) Tika 1.7 possible regression (nested attachment files not getting parsed)

2015-03-28 Thread Rob Tulloh (JIRA)
Rob Tulloh created TIKA-1584:


 Summary: Tika 1.7 possible regression (nested attachment files not 
getting parsed)
 Key: TIKA-1584
 URL: https://issues.apache.org/jira/browse/TIKA-1584
 Project: Tika
  Issue Type: Bug
  Components: server
Affects Versions: 1.7
Reporter: Rob Tulloh


I tried to send this to the tika user list, but got a qmail failure so I am 
opening a jira to see if I can get help with this.

There appears to be a change in the behavior of tika since 1.5 (the last 
version we have used). In 1.5, if we pass a file with content type of rfc822 
which contains a zip that contains a docx file, the entire content would get 
recursed and the text returned. In 1.7, tika only unwinds as far as the zip 
file and ignores the content of the contained docx file. This is causing a 
regression failure in our search tests because the contents of the docx file 
are not found when searched for.
 
We are testing with tika-server if this helps. If we ask the meta service to 
just characterize the test data, it correctly determines the input is of type 
rfc822. However, on extract, the contents of the attachment are not extracted 
as expected.

curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream  
http://localhost:9998/meta 2/dev/null | grep Content-Type
Content-Type,message/rfc822

curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream  
http://localhost:9998/tika 2/dev/null | grep docx
sign.docx   --- this is not expected, need contents of this extracted


We can easily reproduce this problem with a simple eml file with an attachment. 
Can someone please comment if this seems like a problem or perhaps we need to 
change something in our call to get the old behavior?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-1585) Create Example Website with Form Submission

2015-03-28 Thread Tyler Palsulich (JIRA)
Tyler Palsulich created TIKA-1585:
-

 Summary: Create Example Website with Form Submission
 Key: TIKA-1585
 URL: https://issues.apache.org/jira/browse/TIKA-1585
 Project: Tika
  Issue Type: New Feature
  Components: example, server
Reporter: Tyler Palsulich
Assignee: Tyler Palsulich


It would be great to have a website where we can direct people who ask what 
Tika can do for [filetype] without needing them to actually download Tika.

Some initial work to do that is 
[here|http://tpalsulich.github.io/TikaExamples/].

I'm far from a design guru, but I imagine the site as having a form where you 
can upload a file at the top, checkboxes for if you want metadata, content, or 
both, and a submit button. The request should be sent with AJAX and the result 
should populate a {{div}}.

One issue with AJAX requests is that Tika Server doesn't currently allow 
Cross-Origin-Resource-Sharing (CORS). So, we either need to maintain a slightly 
updated tika-server, or update the server to allow configuration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1584) Tika 1.7 possible regression (nested attachment files not getting parsed)

2015-03-28 Thread Rob Tulloh (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385328#comment-14385328
 ] 

Rob Tulloh commented on TIKA-1584:
--

If the .zip file is passed to tika, it shows the same behavior.

{noformat}
curl -X PUT -T sign.zip -H Content-Type:application/octet-stream  
http://localhost:9998/tika 2/dev/null

sign.docx

{noformat}

 Tika 1.7 possible regression (nested attachment files not getting parsed)
 -

 Key: TIKA-1584
 URL: https://issues.apache.org/jira/browse/TIKA-1584
 Project: Tika
  Issue Type: Bug
  Components: server
Affects Versions: 1.7
Reporter: Rob Tulloh

 I tried to send this to the tika user list, but got a qmail failure so I am 
 opening a jira to see if I can get help with this.
 There appears to be a change in the behavior of tika since 1.5 (the last 
 version we have used). In 1.5, if we pass a file with content type of rfc822 
 which contains a zip that contains a docx file, the entire content would get 
 recursed and the text returned. In 1.7, tika only unwinds as far as the zip 
 file and ignores the content of the contained docx file. This is causing a 
 regression failure in our search tests because the contents of the docx file 
 are not found when searched for.
  
 We are testing with tika-server if this helps. If we ask the meta service to 
 just characterize the test data, it correctly determines the input is of type 
 rfc822. However, on extract, the contents of the attachment are not extracted 
 as expected.
 curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream  
 http://localhost:9998/meta 2/dev/null | grep Content-Type
 Content-Type,message/rfc822
 curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream  
 http://localhost:9998/tika 2/dev/null | grep docx
 sign.docx   --- this is not expected, need contents of this extracted
 We can easily reproduce this problem with a simple eml file with an 
 attachment. Can someone please comment if this seems like a problem or 
 perhaps we need to change something in our call to get the old behavior?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1581) jhighlight license concerns

2015-03-28 Thread Ken Krugler (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385331#comment-14385331
 ] 

Ken Krugler commented on TIKA-1581:
---

Hi Tyler,

JHighlight has been updated in Central, and Tika is now using that version.

So I believe it's resolved, as long as the changes that changes that Hong-Thai 
made to the NOTICE.txt are sufficient for the CDDL license used by jhighlight.

And yes, as per my comment above we'll need to release a new version of Tika 
for downstream libraries. Seems like it could be worth a quick dot release for 
ManifoldCF/Lucene.


 jhighlight license concerns
 ---

 Key: TIKA-1581
 URL: https://issues.apache.org/jira/browse/TIKA-1581
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.7
Reporter: Karl Wright
 Fix For: 1.8


 jhighlight jar is a Tika dependency.  The Lucene team discovered that, while 
 it claims to be a CDDL/LGPL dual-license, some of its functionality is LGPL 
 only:
 {code}
 Solr's contrib/extraction contains jhighlight-1.0.jar which declares itself 
 as dual CDDL or LGPL license. However, some of its classes are distributed 
 only under LGPL, e.g.
 com.uwyn.jhighlight.highlighter.
   CppHighlighter.java
   GroovyHighlighter.java
   JavaHighlighter.java
   XmlHighlighter.java
 I downloaded the sources from Maven 
 (http://search.maven.org/remotecontent?filepath=com/uwyn/jhighlight/1.0/jhighlight-1.0-sources.jar)
  to confirm that, and also found this SVN repo: 
 http://svn.rifers.org/jhighlight/tags/release-1.0, though the project's 
 website seems to not exist anymore (https://jhighlight.dev.java.net/).
 I didn't find any direct usage of it in our code, so I guess it's probably 
 needed by a 3rd party dependency, such as Tika. Therefore if we e.g. omit it, 
 things will compile, but may fail at runtime.
 {code}
 Is it possible to remove this dependency for future releases, or allow only 
 optional inclusion of this package?  It is of concern to the ManifoldCF 
 project because we distribute a binary package that includes Tika and its 
 required dependencies, which currently includes jHighlight.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1581) jhighlight license concerns

2015-03-28 Thread Ken Krugler (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385346#comment-14385346
 ] 

Ken Krugler commented on TIKA-1581:
---

Based on what I see in other projects (e.g. the Lucene NOTICE.txt file) this 
seems to be following standard practices, so I'm going to assume it's OK.

 jhighlight license concerns
 ---

 Key: TIKA-1581
 URL: https://issues.apache.org/jira/browse/TIKA-1581
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.7
Reporter: Karl Wright
 Fix For: 1.8


 jhighlight jar is a Tika dependency.  The Lucene team discovered that, while 
 it claims to be a CDDL/LGPL dual-license, some of its functionality is LGPL 
 only:
 {code}
 Solr's contrib/extraction contains jhighlight-1.0.jar which declares itself 
 as dual CDDL or LGPL license. However, some of its classes are distributed 
 only under LGPL, e.g.
 com.uwyn.jhighlight.highlighter.
   CppHighlighter.java
   GroovyHighlighter.java
   JavaHighlighter.java
   XmlHighlighter.java
 I downloaded the sources from Maven 
 (http://search.maven.org/remotecontent?filepath=com/uwyn/jhighlight/1.0/jhighlight-1.0-sources.jar)
  to confirm that, and also found this SVN repo: 
 http://svn.rifers.org/jhighlight/tags/release-1.0, though the project's 
 website seems to not exist anymore (https://jhighlight.dev.java.net/).
 I didn't find any direct usage of it in our code, so I guess it's probably 
 needed by a 3rd party dependency, such as Tika. Therefore if we e.g. omit it, 
 things will compile, but may fail at runtime.
 {code}
 Is it possible to remove this dependency for future releases, or allow only 
 optional inclusion of this package?  It is of concern to the ManifoldCF 
 project because we distribute a binary package that includes Tika and its 
 required dependencies, which currently includes jHighlight.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-1526) ExternalParser should trap/ignore/workarround JDK-8047340 JDK-8055301 so Turkish Tika users can still use non-external parsers

2015-03-28 Thread Tyler Palsulich (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich resolved TIKA-1526.
---
Resolution: Fixed

Marking this as Fixed, per the above comments. [~thetaphi] or [~hossman] or 
anyone else, please reopen this if you find any other cases.

Thank you everyone for the help!

 ExternalParser should trap/ignore/workarround JDK-8047340  JDK-8055301 so 
 Turkish Tika users can still use non-external parsers
 

 Key: TIKA-1526
 URL: https://issues.apache.org/jira/browse/TIKA-1526
 Project: Tika
  Issue Type: Wish
Reporter: Hoss Man

 the JDK has numerous pain points regarding the Turkish locale, posix_spawn 
 lowercasing being one of them...
 https://bugs.openjdk.java.net/browse/JDK-8047340
 https://bugs.openjdk.java.net/browse/JDK-8055301
 As of Tika 1.7, the TesseractOCRParser (which is an ExternalParser) is 
 enabled  configured by default in Tika, and uses ExternalParser.check to see 
 if tesseract is available -- but because of the JDK bug, this means that Tika 
 fails fast for Turkish users on BSD/UNIX variants (including MacOSX) like 
 so...
 {noformat}
   [junit4] Throwable #1: java.lang.Error: posix_spawn is not a supported 
 process launch mechanism on this platform.
   [junit4]   at java.lang.UNIXProcess$1.run(UNIXProcess.java:105)
   [junit4]   at java.lang.UNIXProcess$1.run(UNIXProcess.java:94)
   [junit4]   at java.security.AccessController.doPrivileged(Native 
 Method)
   [junit4]   at java.lang.UNIXProcess.clinit(UNIXProcess.java:92)
   [junit4]   at java.lang.ProcessImpl.start(ProcessImpl.java:130)
   [junit4]   at 
 java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
   [junit4]   at java.lang.Runtime.exec(Runtime.java:620)
   [junit4]   at java.lang.Runtime.exec(Runtime.java:485)
   [junit4]   at 
 org.apache.tika.parser.external.ExternalParser.check(ExternalParser.java:344)
   [junit4]   at 
 org.apache.tika.parser.ocr.TesseractOCRParser.hasTesseract(TesseractOCRParser.java:117)
   [junit4]   at 
 org.apache.tika.parser.ocr.TesseractOCRParser.getSupportedTypes(TesseractOCRParser.java:90)
   [junit4]   at 
 org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81)
   [junit4]   at 
 org.apache.tika.parser.DefaultParser.getParsers(DefaultParser.java:95)
   [junit4]   at 
 org.apache.tika.parser.CompositeParser.getSupportedTypes(CompositeParser.java:229)
   [junit4]   at 
 org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81)
   [junit4]   at 
 org.apache.tika.parser.CompositeParser.getParser(CompositeParser.java:209)
   [junit4]   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
   [junit4]   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
 {noformat}
 ...unless they go out of their way to white list only the parsers they 
 need/want so TesseractOCRParser (and any other ExternalParsers) will never 
 even be check()ed.
 It would be nice if Tika's ExternalParser class added a similar 
 hack/workarround to what was done in SOLR-6387 to trap these types of errors. 
  In Solr we just propogate a better error explaining why Java hates the 
 turkish langauge...
 {code}
 } catch (Error err) {
   if (err.getMessage() != null  (err.getMessage().contains(posix_spawn) 
 || err.getMessage().contains(UNIXProcess))) {
 log.warn(Error forking command due to JVM locale bug (see 
 https://issues.apache.org/jira/browse/SOLR-6387):  + err.getMessage());
 return (error executing:  + cmd + );
   }
 }
 {code}
 ...but with Tika, it might be better for all ExternalParsers to just opt 
 out as if they don't recognize the filetype when they detect this type of 
 error fro m the check method (or perhaps it would be better if 
 AutoDetectParser handled this? ... i'm not really sure how it would best fit 
 into Tika's architecture)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1581) jhighlight license concerns

2015-03-28 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385337#comment-14385337
 ] 

Tyler Palsulich commented on TIKA-1581:
---

Hi [~kkrugler]. Thanks. The comment is now
bq. Tika-parsers component uses CDDL/LGPL dual-licensed dependency: jhighlight 
(https://github.com/codelibs/jhighlight)

If this looks good, I'll start a \[DISCUSS\] thread on the list about a new 
version.

 jhighlight license concerns
 ---

 Key: TIKA-1581
 URL: https://issues.apache.org/jira/browse/TIKA-1581
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.7
Reporter: Karl Wright
 Fix For: 1.8


 jhighlight jar is a Tika dependency.  The Lucene team discovered that, while 
 it claims to be a CDDL/LGPL dual-license, some of its functionality is LGPL 
 only:
 {code}
 Solr's contrib/extraction contains jhighlight-1.0.jar which declares itself 
 as dual CDDL or LGPL license. However, some of its classes are distributed 
 only under LGPL, e.g.
 com.uwyn.jhighlight.highlighter.
   CppHighlighter.java
   GroovyHighlighter.java
   JavaHighlighter.java
   XmlHighlighter.java
 I downloaded the sources from Maven 
 (http://search.maven.org/remotecontent?filepath=com/uwyn/jhighlight/1.0/jhighlight-1.0-sources.jar)
  to confirm that, and also found this SVN repo: 
 http://svn.rifers.org/jhighlight/tags/release-1.0, though the project's 
 website seems to not exist anymore (https://jhighlight.dev.java.net/).
 I didn't find any direct usage of it in our code, so I guess it's probably 
 needed by a 3rd party dependency, such as Tika. Therefore if we e.g. omit it, 
 things will compile, but may fail at runtime.
 {code}
 Is it possible to remove this dependency for future releases, or allow only 
 optional inclusion of this package?  It is of concern to the ManifoldCF 
 project because we distribute a binary package that includes Tika and its 
 required dependencies, which currently includes jHighlight.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-1586) Enable CORS on Tika Server

2015-03-28 Thread Tyler Palsulich (JIRA)
Tyler Palsulich created TIKA-1586:
-

 Summary: Enable CORS on Tika Server
 Key: TIKA-1586
 URL: https://issues.apache.org/jira/browse/TIKA-1586
 Project: Tika
  Issue Type: New Feature
  Components: server
Reporter: Tyler Palsulich
Assignee: Tyler Palsulich


Tika Server should allow configuration of CORS requests (for uses like 
TIKA-1585). See [this example|http://cxf.apache.org/docs/jax-rs-cors.html] from 
CXF for how to add it.

The only change from that site is that we will need to add a 
{{CrossOriginResourceSharingFilter}} as a provider.

Ideally, this is configurable (limit which resources have CORS, and which 
origins are allowed). But, I'm not thinking of any general methods of how to do 
that...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] tika pull request: TIKA-1586. Enable CORS requests on Tika server

2015-03-28 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/tika/pull/37


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (TIKA-1586) Enable CORS on Tika Server

2015-03-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385398#comment-14385398
 ] 

ASF GitHub Bot commented on TIKA-1586:
--

Github user asfgit closed the pull request at:

https://github.com/apache/tika/pull/37


 Enable CORS on Tika Server
 --

 Key: TIKA-1586
 URL: https://issues.apache.org/jira/browse/TIKA-1586
 Project: Tika
  Issue Type: New Feature
  Components: server
Reporter: Tyler Palsulich
Assignee: Tyler Palsulich

 Tika Server should allow configuration of CORS requests (for uses like 
 TIKA-1585). See [this example|http://cxf.apache.org/docs/jax-rs-cors.html] 
 from CXF for how to add it.
 The only change from that site is that we will need to add a 
 {{CrossOriginResourceSharingFilter}} as a provider.
 Ideally, this is configurable (limit which resources have CORS, and which 
 origins are allowed). But, I'm not thinking of any general methods of how to 
 do that...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-1586) Enable CORS on Tika Server

2015-03-28 Thread Tyler Palsulich (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich resolved TIKA-1586.
---
Resolution: Fixed

Fixed in r1669799.

 Enable CORS on Tika Server
 --

 Key: TIKA-1586
 URL: https://issues.apache.org/jira/browse/TIKA-1586
 Project: Tika
  Issue Type: New Feature
  Components: server
Reporter: Tyler Palsulich
Assignee: Tyler Palsulich

 Tika Server should allow configuration of CORS requests (for uses like 
 TIKA-1585). See [this example|http://cxf.apache.org/docs/jax-rs-cors.html] 
 from CXF for how to add it.
 The only change from that site is that we will need to add a 
 {{CrossOriginResourceSharingFilter}} as a provider.
 Ideally, this is configurable (limit which resources have CORS, and which 
 origins are allowed). But, I'm not thinking of any general methods of how to 
 do that...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1354) ForkParser doesn't work in OSGI container

2015-03-28 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385403#comment-14385403
 ] 

Chris A. Mattmann commented on TIKA-1354:
-

thanks!

 ForkParser doesn't work in OSGI container
 -

 Key: TIKA-1354
 URL: https://issues.apache.org/jira/browse/TIKA-1354
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.6
Reporter: Michal Hlavac
 Fix For: 1.7


 I can't find way to run ForkParser in OSGI container.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1577) NetCDF Data Extraction

2015-03-28 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385408#comment-14385408
 ] 

Chris A. Mattmann commented on TIKA-1577:
-

Agreed, if we can reuse this, then great. The one catch is that I'm not sure 
that dump capability generates a table or something in an XHTML representation 
which is our basis representation in Tika. I would like us to consider the 
output of this issue to be:

- TikaParser generates XHTML tabular and other elements that represent the data 
in the NetCDF file
- we create like a ScientificContentHandler that can then take that output from 
the parser (in the data section) and then format it e.g., like NCDump. 

Sound good?

 NetCDF Data Extraction
 --

 Key: TIKA-1577
 URL: https://issues.apache.org/jira/browse/TIKA-1577
 Project: Tika
  Issue Type: Improvement
  Components: handler, parser
Affects Versions: 1.7
Reporter: Ann Burgess
Assignee: Ann Burgess
  Labels: features, handler
 Fix For: 1.8

   Original Estimate: 504h
  Remaining Estimate: 504h

 A netCDF classic or 64-bit offset dataset is stored as a single file 
 comprising two parts:
  - a header, containing all the information about dimensions, attributes, and 
 variables except for the variable data;
  - a data part, comprising fixed-size data, containing the data for variables 
 that don't have an unlimited dimension; and variable-size data, containing 
 the data for variables that have an unlimited dimension.
 The NetCDFparser currently extracts the header part.  
  -- text extracts file Dimensions and Variables
  -- metadata extracts Global Attributes
 We want the option to extract the data part of NetCDF files.  
 Lets use the NetCDF test file for our dev testing:  
 tika/tika-parsers/src/test/resources/test-documents/sresa1b_ncar_ccsm3_0_run1_21.nc
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1586) Enable CORS on Tika Server

2015-03-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385407#comment-14385407
 ] 

Hudson commented on TIKA-1586:
--

SUCCESS: Integrated in tika-trunk-jdk1.7 #578 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/578/])
TIKA-1586. Enable CORS requests on Tika server.

This fixes #37. (tpalsulich: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1669799)
* /tika/trunk/CHANGES.txt
* /tika/trunk/tika-server/pom.xml
* 
/tika/trunk/tika-server/src/main/java/org/apache/tika/server/TikaServerCli.java


 Enable CORS on Tika Server
 --

 Key: TIKA-1586
 URL: https://issues.apache.org/jira/browse/TIKA-1586
 Project: Tika
  Issue Type: New Feature
  Components: server
Reporter: Tyler Palsulich
Assignee: Tyler Palsulich

 Tika Server should allow configuration of CORS requests (for uses like 
 TIKA-1585). See [this example|http://cxf.apache.org/docs/jax-rs-cors.html] 
 from CXF for how to add it.
 The only change from that site is that we will need to add a 
 {{CrossOriginResourceSharingFilter}} as a provider.
 Ideally, this is configurable (limit which resources have CORS, and which 
 origins are allowed). But, I'm not thinking of any general methods of how to 
 do that...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: trunk test failure

2015-03-28 Thread Mattmann, Chris A (3980)
Thanks Oleg!

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++






-Original Message-
From: Oleg Tikhonov olegtikho...@gmail.com
Reply-To: dev@tika.apache.org dev@tika.apache.org
Date: Thursday, March 26, 2015 at 12:19 AM
To: dev@tika.apache.org dev@tika.apache.org
Subject: Re: trunk test failure

Hi Chris,
just to confirm:

[INFO]

[INFO] Reactor Summary:
[INFO]
[INFO] Apache Tika parent . SUCCESS [
9.268 s]
[INFO] Apache Tika core ... SUCCESS [
25.823 s]
[INFO] Apache Tika parsers  SUCCESS [02:41
min]
[INFO] Apache Tika XMP  SUCCESS [
1.986 s]
[INFO] Apache Tika serialization .. SUCCESS [
1.604 s]
[INFO] Apache Tika batch .. SUCCESS [02:02
min]
[INFO] Apache Tika application  SUCCESS [
18.983 s]
[INFO] Apache Tika OSGi bundle  SUCCESS [
29.087 s]
[INFO] Apache Tika server . SUCCESS [
46.706 s]
[INFO] Apache Tika translate .. SUCCESS [
9.163 s]
[INFO] Apache Tika examples ... SUCCESS [
4.134 s]
[INFO] Apache Tika Java-7 Components .. SUCCESS [
1.236 s]
[INFO] Apache Tika  SUCCESS [
0.017 s]
[INFO]

[INFO] BUILD SUCCESS
[INFO]

[INFO] Total time: 07:20 min
[INFO] Finished at: 2015-03-26T09:18:46+02:00
[INFO] Final Memory: 91M/848M
[INFO]



BR,
OLeg

On Thu, Mar 26, 2015 at 1:21 AM, Mattmann, Chris A (3980) 
chris.a.mattm...@jpl.nasa.gov wrote:

 OK I am nuts - I was applying the patch from TIKA-1580, but didn’t
 update Felix in the bundle pom - done now, building again. Yay.


 ++
 Chris Mattmann, Ph.D.
 Chief Architect
 Instrument Software and Science Data Systems Section (398)
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 168-519, Mailstop: 168-527
 Email: chris.a.mattm...@nasa.gov
 WWW:  http://sunset.usc.edu/~mattmann/
 ++
 Adjunct Associate Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++






 -Original Message-
 From: Mattmann, Chris Mattmann chris.a.mattm...@jpl.nasa.gov
 Reply-To: dev@tika.apache.org dev@tika.apache.org
 Date: Wednesday, March 25, 2015 at 6:57 PM
 To: dev@tika.apache.org dev@tika.apache.org
 Subject: trunk test failure

 Hey Anyone else seeing this failure in trunk?
 
 Running org.apache.tika.bundle.BundleIT
 [main] INFO org.ops4j.pax.exam.spi.DefaultExamSystem - Pax Exam System
 (Version: 4.4.0) created.
 [main] INFO org.ops4j.pax.exam.junit.impl.ProbeRunner - creating
PaxExam
 runner for class org.apache.tika.bundle.BundleIT
 [main] INFO org.ops4j.pax.exam.junit.impl.ProbeRunner - running test
class
 org.apache.tika.bundle.BundleIT
 ERROR: Bundle org.apache.tika.bundle [17] Error starting
 
file:/Users/mattmann/tmp/tika/tika-bundle/target/test-bundles/tika-bundl
e.
 j
 ar (org.osgi.framework.BundleException: Unresolved constraint in bundle
 org.apache.tika.bundle [17]: Unable to resolve 17.0: missing
requirement
 [17.0] osgi.wiring.package;
 
((osgi.wiring.package=org.apache.commons.csv)(version=1.0.0)(!(version
=
 2
 .0.0
 org.osgi.framework.BundleException: Unresolved constraint in bundle
 org.apache.tika.bundle [17]: Unable to resolve 17.0: missing
requirement
 [17.0] osgi.wiring.package;
 
((osgi.wiring.package=org.apache.commons.csv)(version=1.0.0)(!(version
=
 2
 .0.0)))
at
 org.apache.felix.framework.Felix.resolveBundleRevision(Felix.java:4097)
at org.apache.felix.framework.Felix.startBundle(Felix.java:2114)
at
 org.apache.felix.framework.Felix.setActiveStartLevel(Felix.java:1368)
at
 
org.apache.felix.framework.FrameworkStartLevelImpl.run(FrameworkStartLev
el
 I
 mpl.java:308)
at java.lang.Thread.run(Thread.java:745)
 [main] ERROR 

[jira] [Commented] (TIKA-1585) Create Example Website with Form Submission

2015-03-28 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385411#comment-14385411
 ] 

Tyler Palsulich commented on TIKA-1585:
---

CORS work is now integrated. [~talli...@mitre.org], can you restart the server 
on 162.242.228.174:9998 with the --cors http://tpalsulich.github.io; option?

Then, we can close off the 9997 port (my github.io site is querying 9997, 
though, so I'll need to update that).

Is there an official place we'd like to host the above site?

 Create Example Website with Form Submission
 ---

 Key: TIKA-1585
 URL: https://issues.apache.org/jira/browse/TIKA-1585
 Project: Tika
  Issue Type: New Feature
  Components: example, server
Reporter: Tyler Palsulich
Assignee: Tyler Palsulich

 It would be great to have a website where we can direct people who ask what 
 Tika can do for [filetype] without needing them to actually download Tika.
 Some initial work to do that is 
 [here|http://tpalsulich.github.io/TikaExamples/].
 I'm far from a design guru, but I imagine the site as having a form where you 
 can upload a file at the top, checkboxes for if you want metadata, content, 
 or both, and a submit button. The request should be sent with AJAX and the 
 result should populate a {{div}}.
 One issue with AJAX requests is that Tika Server doesn't currently allow 
 Cross-Origin-Resource-Sharing (CORS). So, we either need to maintain a 
 slightly updated tika-server, or update the server to allow configuration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


RE: [DISCUSS] Tika 1.8 or 1.7.1

2015-03-28 Thread Ken Krugler
Given how recently we did a 1.7 release, my vote would be for 1.7.1

And to keep this release as simple as possible, just cherry-pick the fix for 
TIKA-1581 into the 1.7 code base.

-- Ken

 From: Tyler Palsulich
 Sent: March 28, 2015 8:01:03am PDT
 To: dev@tika.apache.org
 Subject: [DISCUSS] Tika 1.8 or 1.7.1
 
 Hi Folks,
 
 Now that TIKA-1581 (JHighlight licensing issues) is resolved, we need to
 release a new version of Tika. I'll volunteer to be the release manager
 again.
 
 Should we release this as 1.8 or 1.7.1?
 
 Does anyone have any last minute issues they'd like to finish and see in
 Tika 1.X? I'd like to get the example working with CORS (TIKA-1585 and
 TIKA-1586). Any others?
 
 Have a good weekend,
 Tyler

--
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions  training
Hadoop, Cascading, Cassandra  Solr







[jira] [Commented] (TIKA-1579) Add file type to NetCDFParser

2015-03-28 Thread Ann Burgess (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385383#comment-14385383
 ] 

Ann Burgess commented on TIKA-1579:
---

Yes!

On Sat, Mar 28, 2015 at 6:09 AM, Tyler Palsulich (JIRA) j...@apache.org




-- 
--
Ann Bryant Burgess, PhD

Postdoctoral Fellow
Computer Science Department
Viterbi School of Engineering
University of Southern California

Phone:  (585) 738-7549
--


 Add file type to NetCDFParser
 -

 Key: TIKA-1579
 URL: https://issues.apache.org/jira/browse/TIKA-1579
 Project: Tika
  Issue Type: Improvement
  Components: parser
Reporter: Ann Burgess
Assignee: Ann Burgess
 Attachments: TIKA-1579.abburgess.190315.patch.txt


 [~gostep] explains that, there are three versions of NetCDF (classic format, 
 64-bit offset, and netCDF-4/HDF5 format). When opening an existing netCDF 
 file, the netCDF library will transparently detect its format so we do not 
 need to adjust according to the detected format.
 That said, it would be good to know the file type as each can have the .nc 
 extension.  This will add patch with add file type to the metadata.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[DISCUSS] Tika 1.8 or 1.7.1

2015-03-28 Thread Tyler Palsulich
Hi Folks,

Now that TIKA-1581 (JHighlight licensing issues) is resolved, we need to
release a new version of Tika. I'll volunteer to be the release manager
again.

Should we release this as 1.8 or 1.7.1?

Does anyone have any last minute issues they'd like to finish and see in
Tika 1.X? I'd like to get the example working with CORS (TIKA-1585 and
TIKA-1586). Any others?

Have a good weekend,
Tyler


[GitHub] tika pull request: TIKA-1586. Enable CORS requests on Tika server

2015-03-28 Thread tpalsulich
GitHub user tpalsulich opened a pull request:

https://github.com/apache/tika/pull/37

TIKA-1586. Enable CORS requests on Tika server



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/tpalsulich/tika TIKA-1586

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tika/pull/37.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #37


commit c296a0459477f1ba088d96fa6ba3895e3a6b3ac5
Author: Tyler Palsulich tpalsul...@gmail.com
Date:   2015-03-28T15:45:45Z

TIKA-1586. Enable CORS requests on Tika server.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


Re: [DISCUSS] Tika 1.8 or 1.7.1

2015-03-28 Thread Mattmann, Chris A (3980)
Hi Tyler - I would VOTE for 1.8. Given the stuff associated
with releasing (updating the website; sending emails; waiting
periods, etc.) let’s ship all the updates we have too along
with the jhighlight fix.

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++






-Original Message-
From: Tyler Palsulich tpalsul...@apache.org
Reply-To: dev@tika.apache.org dev@tika.apache.org
Date: Saturday, March 28, 2015 at 8:01 AM
To: dev@tika.apache.org dev@tika.apache.org
Subject: [DISCUSS] Tika 1.8 or 1.7.1

Hi Folks,

Now that TIKA-1581 (JHighlight licensing issues) is resolved, we need to
release a new version of Tika. I'll volunteer to be the release manager
again.

Should we release this as 1.8 or 1.7.1?

Does anyone have any last minute issues they'd like to finish and see in
Tika 1.X? I'd like to get the example working with CORS (TIKA-1585 and
TIKA-1586). Any others?

Have a good weekend,
Tyler



[jira] [Commented] (TIKA-1586) Enable CORS on Tika Server

2015-03-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385369#comment-14385369
 ] 

ASF GitHub Bot commented on TIKA-1586:
--

GitHub user tpalsulich opened a pull request:

https://github.com/apache/tika/pull/37

TIKA-1586. Enable CORS requests on Tika server



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/tpalsulich/tika TIKA-1586

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tika/pull/37.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #37


commit c296a0459477f1ba088d96fa6ba3895e3a6b3ac5
Author: Tyler Palsulich tpalsul...@gmail.com
Date:   2015-03-28T15:45:45Z

TIKA-1586. Enable CORS requests on Tika server.




 Enable CORS on Tika Server
 --

 Key: TIKA-1586
 URL: https://issues.apache.org/jira/browse/TIKA-1586
 Project: Tika
  Issue Type: New Feature
  Components: server
Reporter: Tyler Palsulich
Assignee: Tyler Palsulich

 Tika Server should allow configuration of CORS requests (for uses like 
 TIKA-1585). See [this example|http://cxf.apache.org/docs/jax-rs-cors.html] 
 from CXF for how to add it.
 The only change from that site is that we will need to add a 
 {{CrossOriginResourceSharingFilter}} as a provider.
 Ideally, this is configurable (limit which resources have CORS, and which 
 origins are allowed). But, I'm not thinking of any general methods of how to 
 do that...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1586) Enable CORS on Tika Server

2015-03-28 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385372#comment-14385372
 ] 

Tyler Palsulich commented on TIKA-1586:
---

Can someone take a look at the above PR and make sure I'm not doing anything 
bone-headed? Thanks!

 Enable CORS on Tika Server
 --

 Key: TIKA-1586
 URL: https://issues.apache.org/jira/browse/TIKA-1586
 Project: Tika
  Issue Type: New Feature
  Components: server
Reporter: Tyler Palsulich
Assignee: Tyler Palsulich

 Tika Server should allow configuration of CORS requests (for uses like 
 TIKA-1585). See [this example|http://cxf.apache.org/docs/jax-rs-cors.html] 
 from CXF for how to add it.
 The only change from that site is that we will need to add a 
 {{CrossOriginResourceSharingFilter}} as a provider.
 Ideally, this is configurable (limit which resources have CORS, and which 
 origins are allowed). But, I'm not thinking of any general methods of how to 
 do that...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1584) Tika 1.7 possible regression (nested attachment files not getting parsed)

2015-03-28 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385440#comment-14385440
 ] 

Tim Allison commented on TIKA-1584:
---

Able to attach example triggering doc?  By same behavior do you mean that a 
docx inside a zip is not extracted with -X?

 Tika 1.7 possible regression (nested attachment files not getting parsed)
 -

 Key: TIKA-1584
 URL: https://issues.apache.org/jira/browse/TIKA-1584
 Project: Tika
  Issue Type: Bug
  Components: server
Affects Versions: 1.7
Reporter: Rob Tulloh

 I tried to send this to the tika user list, but got a qmail failure so I am 
 opening a jira to see if I can get help with this.
 There appears to be a change in the behavior of tika since 1.5 (the last 
 version we have used). In 1.5, if we pass a file with content type of rfc822 
 which contains a zip that contains a docx file, the entire content would get 
 recursed and the text returned. In 1.7, tika only unwinds as far as the zip 
 file and ignores the content of the contained docx file. This is causing a 
 regression failure in our search tests because the contents of the docx file 
 are not found when searched for.
  
 We are testing with tika-server if this helps. If we ask the meta service to 
 just characterize the test data, it correctly determines the input is of type 
 rfc822. However, on extract, the contents of the attachment are not extracted 
 as expected.
 curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream  
 http://localhost:9998/meta 2/dev/null | grep Content-Type
 Content-Type,message/rfc822
 curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream  
 http://localhost:9998/tika 2/dev/null | grep docx
 sign.docx   --- this is not expected, need contents of this extracted
 We can easily reproduce this problem with a simple eml file with an 
 attachment. Can someone please comment if this seems like a problem or 
 perhaps we need to change something in our call to get the old behavior?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1584) Tika 1.7 possible regression (nested attachment files not getting parsed)

2015-03-28 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385440#comment-14385440
 ] 

Tim Allison edited comment on TIKA-1584 at 3/28/15 6:08 PM:


Able to attach example triggering doc?  By same behavior do you mean that a 
docx inside a zip is not extracted with /tika? Any luck w /rmeta?


was (Author: talli...@mitre.org):
Able to attach example triggering doc?  By same behavior do you mean that a 
docx inside a zip is not extracted with -X?

 Tika 1.7 possible regression (nested attachment files not getting parsed)
 -

 Key: TIKA-1584
 URL: https://issues.apache.org/jira/browse/TIKA-1584
 Project: Tika
  Issue Type: Bug
  Components: server
Affects Versions: 1.7
Reporter: Rob Tulloh

 I tried to send this to the tika user list, but got a qmail failure so I am 
 opening a jira to see if I can get help with this.
 There appears to be a change in the behavior of tika since 1.5 (the last 
 version we have used). In 1.5, if we pass a file with content type of rfc822 
 which contains a zip that contains a docx file, the entire content would get 
 recursed and the text returned. In 1.7, tika only unwinds as far as the zip 
 file and ignores the content of the contained docx file. This is causing a 
 regression failure in our search tests because the contents of the docx file 
 are not found when searched for.
  
 We are testing with tika-server if this helps. If we ask the meta service to 
 just characterize the test data, it correctly determines the input is of type 
 rfc822. However, on extract, the contents of the attachment are not extracted 
 as expected.
 curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream  
 http://localhost:9998/meta 2/dev/null | grep Content-Type
 Content-Type,message/rfc822
 curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream  
 http://localhost:9998/tika 2/dev/null | grep docx
 sign.docx   --- this is not expected, need contents of this extracted
 We can easily reproduce this problem with a simple eml file with an 
 attachment. Can someone please comment if this seems like a problem or 
 perhaps we need to change something in our call to get the old behavior?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1582) Mime Detection based on neural networks with Byte-frequency-histogram

2015-03-28 Thread Luke sh (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luke sh updated TIKA-1582:
--
Attachment: nnmodel.docx

Documentation 

 Mime Detection based on neural networks with Byte-frequency-histogram 
 --

 Key: TIKA-1582
 URL: https://issues.apache.org/jira/browse/TIKA-1582
 Project: Tika
  Issue Type: Improvement
  Components: detector, mime
Affects Versions: 1.7
Reporter: Luke sh
Priority: Trivial
 Attachments: nnmodel.docx


 Content-based mime type detection is one of the popular approaches to detect 
 mime type, there are others based on file extension and magic numbers ; And 
 currently Tika has implemented 3 approaches in detecting mime types; 
 They are :
 1) file extensions
 2) magic numbers (the most trustworthy in tika)
 3) content-type(the header in the http response if present and available) 
 Content-based mime type detection however analyses the distribution of the 
 entire stream of bytes and find a similar pattern for the same type and build 
 a function that is able to group them into one or several classes so as to 
 classify and predict; It is believed this feature might broaden the usage of 
 Tika with a bit more security enforcement for mime type detection. Because we 
 want to build a model that is etched with the patterns it has seen, in some 
 situations we may not trust those types which have not been trained/learned 
 by the model. In some situations, magic numbers imbedded in the files can be 
 copied but the actual content could be a potentially detrimental Troy 
 program. By enforcing the trust on byte frequency patterns, we are able to 
 enhance the security of the detection.
 The proposed content-based mime detection to be integrated into Tika is based 
 on the machine learning algorithm i.e. neural network with back-propagation. 
 The input: 0-255 bins each of which represent a byte, and and each of which 
 stores the count of occurrences for each byte, and the byte frequency 
 histograms are normalized to fall in the range between 0 and 1, they then are 
 passed to a companding function to enhancement the infrequent bytes.
 The output of the neural network is a binary decision 1 or 0;
 Notice BTW, the proposed feature will be implemented with GRB file type as 
 one example.
 In this example, we build a model that is able to classify GRB file type from 
 non-GRB file types, notice the size of non-GRB files is huge and cannot be 
 easily defined, so there need to be as many negative training example as 
 possible to form this non-GRB types decision boundary.
 The Neural networks is considered as two stage of processes.
 Training and classification.
 The training can be done in any programming language, in this feature 
 /research, the training of neural network is implemented in R and the source 
 can be found in my github repository i.e. 
 https://github.com/LukeLiush/filetypeDetection; i am also going to post a 
 document that describe the use of the program, the syntax/ format of the 
 input and output.
 After training, we need to export the model and import it to Tika; in Tika, 
 we create a TrainedModelDetector that reads this model file with one or more 
 model parameters or several model files,so it can detect the mime types with 
 the model of those mime types. Details of the research and usage with this 
 proposed feature will be posted on my github shortly.
 It is worth noting again that in this research we only worked out one model - 
 GRB as one example to demonstrate the use of this content-based mime 
 detection. One of the challenges again is that the non-GRB file types cannot 
 be clearly defined unless we feed our model with some example data for all of 
 the existing file types in the world, but this seems to be too utopian and a 
 bit less likely, so it is better that the set of class/types is given and 
 defined in advance to minimize the problem domain. 
 Another challenge is the size of the training data; even if we know the types 
 we want to classify, getting enough training data to form a model can be also 
 one of the main factors of success. In our example model, grb data are 
 collected from ftp://hydro1.sci.gsfc.nasa.gov/data/; and we find out that the 
 grb data from that source all exhibit a similar pattern, a simple neural 
 network structure is able to predict well, even a linear logistic regression 
 is able to do a good job; However, if we pass the GRB files collected from 
 other source to the model for prediction, then we find out that the model 
 predict poorly and unexpectedly, so this bring up the aspect of whether we 
 need to include all training data or those are of interest, including all 
 data is very expensive so it is necessary to introduce some domain knowledge 
 to 

[jira] [Updated] (TIKA-1582) Mime Detection based on neural networks with Byte-frequency-histogram

2015-03-28 Thread Luke sh (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luke sh updated TIKA-1582:
--
Attachment: week2-report-histogram comparison.docx

histogram comparison

 Mime Detection based on neural networks with Byte-frequency-histogram 
 --

 Key: TIKA-1582
 URL: https://issues.apache.org/jira/browse/TIKA-1582
 Project: Tika
  Issue Type: Improvement
  Components: detector, mime
Affects Versions: 1.7
Reporter: Luke sh
Priority: Trivial
 Attachments: nnmodel.docx, week2-report-histogram comparison.docx, 
 week6 report.docx


 Content-based mime type detection is one of the popular approaches to detect 
 mime type, there are others based on file extension and magic numbers ; And 
 currently Tika has implemented 3 approaches in detecting mime types; 
 They are :
 1) file extensions
 2) magic numbers (the most trustworthy in tika)
 3) content-type(the header in the http response if present and available) 
 Content-based mime type detection however analyses the distribution of the 
 entire stream of bytes and find a similar pattern for the same type and build 
 a function that is able to group them into one or several classes so as to 
 classify and predict; It is believed this feature might broaden the usage of 
 Tika with a bit more security enforcement for mime type detection. Because we 
 want to build a model that is etched with the patterns it has seen, in some 
 situations we may not trust those types which have not been trained/learned 
 by the model. In some situations, magic numbers imbedded in the files can be 
 copied but the actual content could be a potentially detrimental Troy 
 program. By enforcing the trust on byte frequency patterns, we are able to 
 enhance the security of the detection.
 The proposed content-based mime detection to be integrated into Tika is based 
 on the machine learning algorithm i.e. neural network with back-propagation. 
 The input: 0-255 bins each of which represent a byte, and and each of which 
 stores the count of occurrences for each byte, and the byte frequency 
 histograms are normalized to fall in the range between 0 and 1, they then are 
 passed to a companding function to enhancement the infrequent bytes.
 The output of the neural network is a binary decision 1 or 0;
 Notice BTW, the proposed feature will be implemented with GRB file type as 
 one example.
 In this example, we build a model that is able to classify GRB file type from 
 non-GRB file types, notice the size of non-GRB files is huge and cannot be 
 easily defined, so there need to be as many negative training example as 
 possible to form this non-GRB types decision boundary.
 The Neural networks is considered as two stage of processes.
 Training and classification.
 The training can be done in any programming language, in this feature 
 /research, the training of neural network is implemented in R and the source 
 can be found in my github repository i.e. 
 https://github.com/LukeLiush/filetypeDetection; i am also going to post a 
 document that describe the use of the program, the syntax/ format of the 
 input and output.
 After training, we need to export the model and import it to Tika; in Tika, 
 we create a TrainedModelDetector that reads this model file with one or more 
 model parameters or several model files,so it can detect the mime types with 
 the model of those mime types. Details of the research and usage with this 
 proposed feature will be posted on my github shortly.
 It is worth noting again that in this research we only worked out one model - 
 GRB as one example to demonstrate the use of this content-based mime 
 detection. One of the challenges again is that the non-GRB file types cannot 
 be clearly defined unless we feed our model with some example data for all of 
 the existing file types in the world, but this seems to be too utopian and a 
 bit less likely, so it is better that the set of class/types is given and 
 defined in advance to minimize the problem domain. 
 Another challenge is the size of the training data; even if we know the types 
 we want to classify, getting enough training data to form a model can be also 
 one of the main factors of success. In our example model, grb data are 
 collected from ftp://hydro1.sci.gsfc.nasa.gov/data/; and we find out that the 
 grb data from that source all exhibit a similar pattern, a simple neural 
 network structure is able to predict well, even a linear logistic regression 
 is able to do a good job; However, if we pass the GRB files collected from 
 other source to the model for prediction, then we find out that the model 
 predict poorly and unexpectedly, so this bring up the aspect of whether we 
 need to include all training data or those are of interest, 

[jira] [Updated] (TIKA-1582) Mime Detection based on neural networks with Byte-frequency-histogram

2015-03-28 Thread Luke sh (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luke sh updated TIKA-1582:
--
Attachment: week6 report.docx

Test report 

 Mime Detection based on neural networks with Byte-frequency-histogram 
 --

 Key: TIKA-1582
 URL: https://issues.apache.org/jira/browse/TIKA-1582
 Project: Tika
  Issue Type: Improvement
  Components: detector, mime
Affects Versions: 1.7
Reporter: Luke sh
Priority: Trivial
 Attachments: nnmodel.docx, week6 report.docx


 Content-based mime type detection is one of the popular approaches to detect 
 mime type, there are others based on file extension and magic numbers ; And 
 currently Tika has implemented 3 approaches in detecting mime types; 
 They are :
 1) file extensions
 2) magic numbers (the most trustworthy in tika)
 3) content-type(the header in the http response if present and available) 
 Content-based mime type detection however analyses the distribution of the 
 entire stream of bytes and find a similar pattern for the same type and build 
 a function that is able to group them into one or several classes so as to 
 classify and predict; It is believed this feature might broaden the usage of 
 Tika with a bit more security enforcement for mime type detection. Because we 
 want to build a model that is etched with the patterns it has seen, in some 
 situations we may not trust those types which have not been trained/learned 
 by the model. In some situations, magic numbers imbedded in the files can be 
 copied but the actual content could be a potentially detrimental Troy 
 program. By enforcing the trust on byte frequency patterns, we are able to 
 enhance the security of the detection.
 The proposed content-based mime detection to be integrated into Tika is based 
 on the machine learning algorithm i.e. neural network with back-propagation. 
 The input: 0-255 bins each of which represent a byte, and and each of which 
 stores the count of occurrences for each byte, and the byte frequency 
 histograms are normalized to fall in the range between 0 and 1, they then are 
 passed to a companding function to enhancement the infrequent bytes.
 The output of the neural network is a binary decision 1 or 0;
 Notice BTW, the proposed feature will be implemented with GRB file type as 
 one example.
 In this example, we build a model that is able to classify GRB file type from 
 non-GRB file types, notice the size of non-GRB files is huge and cannot be 
 easily defined, so there need to be as many negative training example as 
 possible to form this non-GRB types decision boundary.
 The Neural networks is considered as two stage of processes.
 Training and classification.
 The training can be done in any programming language, in this feature 
 /research, the training of neural network is implemented in R and the source 
 can be found in my github repository i.e. 
 https://github.com/LukeLiush/filetypeDetection; i am also going to post a 
 document that describe the use of the program, the syntax/ format of the 
 input and output.
 After training, we need to export the model and import it to Tika; in Tika, 
 we create a TrainedModelDetector that reads this model file with one or more 
 model parameters or several model files,so it can detect the mime types with 
 the model of those mime types. Details of the research and usage with this 
 proposed feature will be posted on my github shortly.
 It is worth noting again that in this research we only worked out one model - 
 GRB as one example to demonstrate the use of this content-based mime 
 detection. One of the challenges again is that the non-GRB file types cannot 
 be clearly defined unless we feed our model with some example data for all of 
 the existing file types in the world, but this seems to be too utopian and a 
 bit less likely, so it is better that the set of class/types is given and 
 defined in advance to minimize the problem domain. 
 Another challenge is the size of the training data; even if we know the types 
 we want to classify, getting enough training data to form a model can be also 
 one of the main factors of success. In our example model, grb data are 
 collected from ftp://hydro1.sci.gsfc.nasa.gov/data/; and we find out that the 
 grb data from that source all exhibit a similar pattern, a simple neural 
 network structure is able to predict well, even a linear logistic regression 
 is able to do a good job; However, if we pass the GRB files collected from 
 other source to the model for prediction, then we find out that the model 
 predict poorly and unexpectedly, so this bring up the aspect of whether we 
 need to include all training data or those are of interest, including all 
 data is very expensive so it is necessary to introduce some 

[GitHub] tika pull request: Nn branch

2015-03-28 Thread LukeLiush
GitHub user LukeLiush opened a pull request:

https://github.com/apache/tika/pull/36

Nn branch

https://issues.apache.org/jira/browse/TIKA-1582


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/LukeLiush/tika nnBranch

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tika/pull/36.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #36


commit eb04f13260bfb5e4f4b0bf7fd54ecd085995cb92
Author: LukeLiush hanson311...@gmail.com
Date:   2015-03-28T07:12:06Z

https://issues.apache.org/jira/browse/TIKA-1582

commit acaf27bb666fdef05bdb18d7edcaafe7ccfd9bf5
Author: LukeLiush hanson311...@gmail.com
Date:   2015-03-28T07:16:07Z

move the comments of apache licence to the top

commit 701fcc394ed2110e4c771fbb84999dca77932392
Author: LukeLiush hanson311...@gmail.com
Date:   2015-03-28T07:19:43Z

add some comments

commit 12f290826a88cd99bbf2e1a0385b315e73e3
Author: LukeLiush hanson311...@gmail.com
Date:   2015-03-28T07:25:55Z

move the example model file to the test resource directory

commit 6c8d2e523c427380438f24d90985e28bfdbce050
Author: LukeLiush hanson311...@gmail.com
Date:   2015-03-28T07:28:25Z

remove empty comment block




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (TIKA-1582) Mime Detection based on neural networks with Byte-frequency-histogram

2015-03-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385177#comment-14385177
 ] 

ASF GitHub Bot commented on TIKA-1582:
--

GitHub user LukeLiush opened a pull request:

https://github.com/apache/tika/pull/36

Nn branch

https://issues.apache.org/jira/browse/TIKA-1582


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/LukeLiush/tika nnBranch

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tika/pull/36.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #36


commit eb04f13260bfb5e4f4b0bf7fd54ecd085995cb92
Author: LukeLiush hanson311...@gmail.com
Date:   2015-03-28T07:12:06Z

https://issues.apache.org/jira/browse/TIKA-1582

commit acaf27bb666fdef05bdb18d7edcaafe7ccfd9bf5
Author: LukeLiush hanson311...@gmail.com
Date:   2015-03-28T07:16:07Z

move the comments of apache licence to the top

commit 701fcc394ed2110e4c771fbb84999dca77932392
Author: LukeLiush hanson311...@gmail.com
Date:   2015-03-28T07:19:43Z

add some comments

commit 12f290826a88cd99bbf2e1a0385b315e73e3
Author: LukeLiush hanson311...@gmail.com
Date:   2015-03-28T07:25:55Z

move the example model file to the test resource directory

commit 6c8d2e523c427380438f24d90985e28bfdbce050
Author: LukeLiush hanson311...@gmail.com
Date:   2015-03-28T07:28:25Z

remove empty comment block




 Mime Detection based on neural networks with Byte-frequency-histogram 
 --

 Key: TIKA-1582
 URL: https://issues.apache.org/jira/browse/TIKA-1582
 Project: Tika
  Issue Type: Improvement
  Components: detector, mime
Affects Versions: 1.7
Reporter: Luke sh
Priority: Trivial

 Content-based mime type detection is one of the popular approaches to detect 
 mime type, there are others based on file extension and magic numbers ; And 
 currently Tika has implemented 3 approaches in detecting mime types; 
 They are :
 1) file extensions
 2) magic numbers (the most trustworthy in tika)
 3) content-type(the header in the http response if present and available) 
 Content-based mime type detection however analyses the distribution of the 
 entire stream of bytes and find a similar pattern for the same type and build 
 a function that is able to group them into one or several classes so as to 
 classify and predict; It is believed this feature might broaden the usage of 
 Tika with a bit more security enforcement for mime type detection. Because we 
 want to build a model that is etched with the patterns it has seen, in some 
 situations we may not trust those types which have not been trained/learned 
 by the model. In some situations, magic numbers imbedded in the files can be 
 copied but the actual content could be a potentially detrimental Troy 
 program. By enforcing the trust on byte frequency patterns, we are able to 
 enhance the security of the detection.
 The proposed content-based mime detection to be integrated into Tika is based 
 on the machine learning algorithm i.e. neural network with back-propagation. 
 The input: 0-255 bins each of which represent a byte, and and each of which 
 stores the count of occurrences for each byte, and the byte frequency 
 histograms are normalized to fall in the range between 0 and 1, they then are 
 passed to a companding function to enhancement the infrequent bytes.
 The output of the neural network is a binary decision 1 or 0;
 Notice BTW, the proposed feature will be implemented with GRB file type as 
 one example.
 In this example, we build a model that is able to classify GRB file type from 
 non-GRB file types, notice the size of non-GRB files is huge and cannot be 
 easily defined, so there need to be as many negative training example as 
 possible to form this non-GRB types decision boundary.
 The Neural networks is considered as two stage of processes.
 Training and classification.
 The training can be done in any programming language, in this feature 
 /research, the training of neural network is implemented in R and the source 
 can be found in my github repository i.e. 
 https://github.com/LukeLiush/filetypeDetection; i am also going to post a 
 document that describe the use of the program, the syntax/ format of the 
 input and output.
 After training, we need to export the model and import it to Tika; in Tika, 
 we create a TrainedModelDetector that reads this model file with one or more 
 model parameters or several model files,so it can detect the mime types with 
 the model of those mime types. Details of the research and usage with this 
 proposed feature will be posted on my github shortly.