date:20150328


 [ 
https://issues.apache.org/jira/browse/TIKA-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison reassigned TIKA-1584:
-

Assignee: Tim Allison

 Tika 1.7 possible regression (nested attachment files not getting parsed)
 -

 Key: TIKA-1584
 URL: https://issues.apache.org/jira/browse/TIKA-1584
 Project: Tika
  Issue Type: Bug
  Components: server
Affects Versions: 1.7
Reporter: Rob Tulloh
Assignee: Tim Allison
Priority: Blocker

 I tried to send this to the tika user list, but got a qmail failure so I am 
 opening a jira to see if I can get help with this.
 There appears to be a change in the behavior of tika since 1.5 (the last 
 version we have used). In 1.5, if we pass a file with content type of rfc822 
 which contains a zip that contains a docx file, the entire content would get 
 recursed and the text returned. In 1.7, tika only unwinds as far as the zip 
 file and ignores the content of the contained docx file. This is causing a 
 regression failure in our search tests because the contents of the docx file 
 are not found when searched for.
  
 We are testing with tika-server if this helps. If we ask the meta service to 
 just characterize the test data, it correctly determines the input is of type 
 rfc822. However, on extract, the contents of the attachment are not extracted 
 as expected.
 curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream  
 http://localhost:9998/meta 2/dev/null | grep Content-Type
 Content-Type,message/rfc822
 curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream  
 http://localhost:9998/tika 2/dev/null | grep docx
 sign.docx   --- this is not expected, need contents of this extracted
 We can easily reproduce this problem with a simple eml file with an 
 attachment. Can someone please comment if this seems like a problem or 
 perhaps we need to change something in our call to get the old behavior?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: Review Request 32291: ISATab parsers (preliminary version)

2015-03-28 Thread Chris Mattmann


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/32291/#review78159
---

Ship it!


I have ran this in production and it works awesome! Thanks Giuseppe!

- Chris Mattmann


On March 23, 2015, 5:04 p.m., Giuseppe Totaro wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/32291/
 ---
 
 (Updated March 23, 2015, 5:04 p.m.)
 
 
 Review request for tika and Chris Mattmann.
 
 
 Bugs: TIKA-1580
 https://issues.apache.org/jira/browse/TIKA-1580
 
 
 Repository: tika
 
 
 Description
 ---
 
 ISATab parsers. This preliminary solution provides three parsers, one for 
 each ISA-Tab filetype (Investigation, Study, Assay).
 
 
 Diffs
 -
 
   trunk/tika-bundle/pom.xml 1668683 
   trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml 
 1668683 
   trunk/tika-parsers/pom.xml 1668683 
   
 trunk/tika-parsers/src/main/java/org/apache/tika/parser/isatab/ISATabAssayParser.java
  PRE-CREATION 
   
 trunk/tika-parsers/src/main/java/org/apache/tika/parser/isatab/ISATabInvestigationParser.java
  PRE-CREATION 
   
 trunk/tika-parsers/src/main/java/org/apache/tika/parser/isatab/ISATabStudyParser.java
  PRE-CREATION 
   
 trunk/tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser
  1668683 
   
 trunk/tika-parsers/src/test/java/org/apache/tika/parser/isatab/ISATabAssayParserTest.java
  PRE-CREATION 
   
 trunk/tika-parsers/src/test/java/org/apache/tika/parser/isatab/ISATabInvestigationParserTest.java
  PRE-CREATION 
   
 trunk/tika-parsers/src/test/java/org/apache/tika/parser/isatab/ISATabStudyParserTest.java
  PRE-CREATION 
   
 trunk/tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/a_bii-s-2_metabolite
  profiling_NMR spectroscopy.txt PRE-CREATION 
   
 trunk/tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/a_metabolome.txt
  PRE-CREATION 
   
 trunk/tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/a_microarray.txt
  PRE-CREATION 
   
 trunk/tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/a_proteome.txt
  PRE-CREATION 
   
 trunk/tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/a_transcriptome.txt
  PRE-CREATION 
   
 trunk/tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/i_investigation.txt
  PRE-CREATION 
   
 trunk/tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/s_BII-S-1.txt
  PRE-CREATION 
   
 trunk/tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/s_BII-S-2.txt
  PRE-CREATION 
 
 Diff: https://reviews.apache.org/r/32291/diff/
 
 
 Testing
 ---
 
 Tested on sample ISA-Tab files downloaded from 
 http://www.isa-tools.org/format/examples/.
 
 
 Thanks,
 
 Giuseppe Totaro

Re: Review Request 32291: ISATab parsers (preliminary version)

2015-03-28 Thread Chris Mattmann


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/32291/#review78158
---

Ship it!


Ship It!

- Chris Mattmann


On March 23, 2015, 5:04 p.m., Giuseppe Totaro wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/32291/
 ---
 
 (Updated March 23, 2015, 5:04 p.m.)
 
 
 Review request for tika and Chris Mattmann.
 
 
 Bugs: TIKA-1580
 https://issues.apache.org/jira/browse/TIKA-1580
 
 
 Repository: tika
 
 
 Description
 ---
 
 ISATab parsers. This preliminary solution provides three parsers, one for 
 each ISA-Tab filetype (Investigation, Study, Assay).
 
 
 Diffs
 -
 
   trunk/tika-bundle/pom.xml 1668683 
   trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml 
 1668683 
   trunk/tika-parsers/pom.xml 1668683 
   
 trunk/tika-parsers/src/main/java/org/apache/tika/parser/isatab/ISATabAssayParser.java
  PRE-CREATION 
   
 trunk/tika-parsers/src/main/java/org/apache/tika/parser/isatab/ISATabInvestigationParser.java
  PRE-CREATION 
   
 trunk/tika-parsers/src/main/java/org/apache/tika/parser/isatab/ISATabStudyParser.java
  PRE-CREATION 
   
 trunk/tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser
  1668683 
   
 trunk/tika-parsers/src/test/java/org/apache/tika/parser/isatab/ISATabAssayParserTest.java
  PRE-CREATION 
   
 trunk/tika-parsers/src/test/java/org/apache/tika/parser/isatab/ISATabInvestigationParserTest.java
  PRE-CREATION 
   
 trunk/tika-parsers/src/test/java/org/apache/tika/parser/isatab/ISATabStudyParserTest.java
  PRE-CREATION 
   
 trunk/tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/a_bii-s-2_metabolite
  profiling_NMR spectroscopy.txt PRE-CREATION 
   
 trunk/tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/a_metabolome.txt
  PRE-CREATION 
   
 trunk/tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/a_microarray.txt
  PRE-CREATION 
   
 trunk/tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/a_proteome.txt
  PRE-CREATION 
   
 trunk/tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/a_transcriptome.txt
  PRE-CREATION 
   
 trunk/tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/i_investigation.txt
  PRE-CREATION 
   
 trunk/tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/s_BII-S-1.txt
  PRE-CREATION 
   
 trunk/tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/s_BII-S-2.txt
  PRE-CREATION 
 
 Diff: https://reviews.apache.org/r/32291/diff/
 
 
 Testing
 ---
 
 Tested on sample ISA-Tab files downloaded from 
 http://www.isa-tools.org/format/examples/.
 
 
 Thanks,
 
 Giuseppe Totaro

[jira] [Updated] (TIKA-1584) Tika 1.7 possible regression (nested attachment files not getting parsed)


 [ 
https://issues.apache.org/jira/browse/TIKA-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1584:
--
Priority: Blocker  (was: Major)

 Tika 1.7 possible regression (nested attachment files not getting parsed)
 -

 Key: TIKA-1584
 URL: https://issues.apache.org/jira/browse/TIKA-1584
 Project: Tika
  Issue Type: Bug
  Components: server
Affects Versions: 1.7
Reporter: Rob Tulloh
Priority: Blocker

 I tried to send this to the tika user list, but got a qmail failure so I am 
 opening a jira to see if I can get help with this.
 There appears to be a change in the behavior of tika since 1.5 (the last 
 version we have used). In 1.5, if we pass a file with content type of rfc822 
 which contains a zip that contains a docx file, the entire content would get 
 recursed and the text returned. In 1.7, tika only unwinds as far as the zip 
 file and ignores the content of the contained docx file. This is causing a 
 regression failure in our search tests because the contents of the docx file 
 are not found when searched for.
  
 We are testing with tika-server if this helps. If we ask the meta service to 
 just characterize the test data, it correctly determines the input is of type 
 rfc822. However, on extract, the contents of the attachment are not extracted 
 as expected.
 curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream  
 http://localhost:9998/meta 2/dev/null | grep Content-Type
 Content-Type,message/rfc822
 curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream  
 http://localhost:9998/tika 2/dev/null | grep docx
 sign.docx   --- this is not expected, need contents of this extracted
 We can easily reproduce this problem with a simple eml file with an 
 attachment. Can someone please comment if this seems like a problem or 
 perhaps we need to change something in our call to get the old behavior?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1575) Upgrade to PDFBox 1.8.9 when available

2015-03-28 Thread Konstantin Gribov (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385561#comment-14385561
 ] 

Konstantin Gribov commented on TIKA-1575:
-

What about updating to released pdfbox 1.8.9? 

Extracting from {{966679.pdf}} (PDFBOX-2261) hangs on trunk Tika w/ 1.8.8. And 
I see no difference when extracting {{10-814_Appendix B_v3.pdf}}, so form 
extraction issue seems to be fixed.

My questing is related to discuss about releasing Tika 1.8/1.7.1, see dev@ 
(https://mail-archives.apache.org/mod_mbox/tika-dev/201503.mbox/%3CCAM%3DrFA6vvFV3XqpvSNCrubrVHhVO%3Dq%2BighPMRUkmA9f-fKkSXA%40mail.gmail.com%3E)

 Upgrade to PDFBox 1.8.9 when available
 --

 Key: TIKA-1575
 URL: https://issues.apache.org/jira/browse/TIKA-1575
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Priority: Minor
 Attachments: 005937.pdf.json, 005937_1_8_9-SNAPSHOT.pdf.json, 
 10-814_Appendix B_v3.pdf, 524276_719128_diffs.zip, 
 PDFBox_1_8_8VPDFBox_1_8_9-SNAPSHOT.xlsx, 
 PDFBox_1_8_8VPDFBox_1_8_9-SNAPSHOT_reports.zip, 
 PDFBox_1_8_8Vs1_8_9_20150316.zip, caught_ex_1_8_9.zip, 
 content_diffs_20150316.xlsx, diffs_1_8_9_multithread_vs_single_thread.xlsx, 
 reports_1_8_9_multithread_vs_single.zip


 The PDFBox community is about to release 1.8.9.  Let's use this issue to 
 track discussions before the release and to track Tika's upgrade to PDFBox 
 1.8.9



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1584) Tika 1.7 possible regression (nested attachment files not getting parsed)

[
https://issues.apache.org/jira/browse/TIKA-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385477#comment-14385477
]

Tim Allison commented on TIKA-1584:
---

IMHO this is major enough for a fix asap. Whether that's 1.7.1 with just this
fix or a full cut of trunk as 1.8 is up to all devs. Tika colleagues, what do
you think?

Tika 1.7 possible regression (nested attachment files not getting parsed)
-

Key: TIKA-1584
URL: https://issues.apache.org/jira/browse/TIKA-1584
Project: Tika
Issue Type: Bug
Components: server
Affects Versions: 1.7
Reporter: Rob Tulloh
Assignee: Tim Allison
Priority: Blocker

I tried to send this to the tika user list, but got a qmail failure so I am
opening a jira to see if I can get help with this.
There appears to be a change in the behavior of tika since 1.5 (the last
version we have used). In 1.5, if we pass a file with content type of rfc822
which contains a zip that contains a docx file, the entire content would get
recursed and the text returned. In 1.7, tika only unwinds as far as the zip
file and ignores the content of the contained docx file. This is causing a
regression failure in our search tests because the contents of the docx file
are not found when searched for.

We are testing with tika-server if this helps. If we ask the meta service to
just characterize the test data, it correctly determines the input is of type
rfc822. However, on extract, the contents of the attachment are not extracted
as expected.
curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream
http://localhost:9998/meta 2/dev/null | grep Content-Type
Content-Type,message/rfc822
curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream
http://localhost:9998/tika 2/dev/null | grep docx
sign.docx --- this is not expected, need contents of this extracted
We can easily reproduce this problem with a simple eml file with an
attachment. Can someone please comment if this seems like a problem or
perhaps we need to change something in our call to get the old behavior?

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: [DISCUSS] Tika 1.8 or 1.7.1

2015-03-28 Thread Konstantin Gribov

Also, I think, we should resolve TIKA-1575 (upgrade to pdfbox 1.8.9) since
pdfbox 1.8.8 hangs on some pdf forms.

-- 
Best regards,
Konstantin Gribov

сб, 28 марта 2015 г. в 23:22, Konstantin Gribov gros...@gmail.com:

 +1 to releasing 1.8.

 --
 Best regards,
 Konstantin Gribov

 сб, 28 марта 2015, 22:25, Tyler Palsulich tpalsul...@apache.org:

 I'm also leaning toward 1.8. Especially given the newly identified
 regression in TIKA-1584.

 Tyler
 On Mar 28, 2015 11:47 AM, Mattmann, Chris A (3980) 
 chris.a.mattm...@jpl.nasa.gov wrote:

  Hi Tyler - I would VOTE for 1.8. Given the stuff associated
  with releasing (updating the website; sending emails; waiting
  periods, etc.) let’s ship all the updates we have too along
  with the jhighlight fix.
 
  Cheers,
  Chris
 
  ++
  Chris Mattmann, Ph.D.
  Chief Architect
  Instrument Software and Science Data Systems Section (398)
  NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
  Office: 168-519, Mailstop: 168-527
  Email: chris.a.mattm...@nasa.gov
  WWW:  http://sunset.usc.edu/~mattmann/
  ++
  Adjunct Associate Professor, Computer Science Department
  University of Southern California, Los Angeles, CA 90089 USA
  ++
 
 
 
 
 
 
  -Original Message-
  From: Tyler Palsulich tpalsul...@apache.org
  Reply-To: dev@tika.apache.org dev@tika.apache.org
  Date: Saturday, March 28, 2015 at 8:01 AM
  To: dev@tika.apache.org dev@tika.apache.org
  Subject: [DISCUSS] Tika 1.8 or 1.7.1
 
  Hi Folks,
  
  Now that TIKA-1581 (JHighlight licensing issues) is resolved, we need
 to
  release a new version of Tika. I'll volunteer to be the release manager
  again.
  
  Should we release this as 1.8 or 1.7.1?
  
  Does anyone have any last minute issues they'd like to finish and see
 in
  Tika 1.X? I'd like to get the example working with CORS (TIKA-1585 and
  TIKA-1586). Any others?
  
  Have a good weekend,
  Tyler

Re: [DISCUSS] Tika 1.8 or 1.7.1

2015-03-28 Thread Allison, Timothy B.

Once we fix TIKA-1584, I don't have a preference.  I defer to Chris's 
experience (so I guess, +1 for 1.8) given the amount of work required.

It'd be great if we could make sure we aren't bundling any pdfs in our tika-app 
jar, too.  Many apologies if that's been fixed!

From: Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov
Sent: Saturday, March 28, 2015 11:41 AM
To: dev@tika.apache.org
Subject: Re: [DISCUSS] Tika 1.8 or 1.7.1

Hi Tyler - I would VOTE for 1.8. Given the stuff associated
with releasing (updating the website; sending emails; waiting
periods, etc.) let’s ship all the updates we have too along
with the jhighlight fix.

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++

-Original Message-
From: Tyler Palsulich tpalsul...@apache.org
Reply-To: dev@tika.apache.org dev@tika.apache.org
Date: Saturday, March 28, 2015 at 8:01 AM
To: dev@tika.apache.org dev@tika.apache.org
Subject: [DISCUSS] Tika 1.8 or 1.7.1

Hi Folks,

Now that TIKA-1581 (JHighlight licensing issues) is resolved, we need to
release a new version of Tika. I'll volunteer to be the release manager
again.

Should we release this as 1.8 or 1.7.1?

Does anyone have any last minute issues they'd like to finish and see in
Tika 1.X? I'd like to get the example working with CORS (TIKA-1585 and
TIKA-1586). Any others?

Have a good weekend,
Tyler

[jira] [Commented] (TIKA-1584) Tika 1.7 possible regression (nested attachment files not getting parsed)


[ 
https://issues.apache.org/jira/browse/TIKA-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385494#comment-14385494
 ] 

Rob Tulloh commented on TIKA-1584:
--

I would vote for a release as we have been waiting for tika-1371and were hoping 
to upgrade to 1.7

 Tika 1.7 possible regression (nested attachment files not getting parsed)
 -

 Key: TIKA-1584
 URL: https://issues.apache.org/jira/browse/TIKA-1584
 Project: Tika
  Issue Type: Bug
  Components: server
Affects Versions: 1.7
Reporter: Rob Tulloh
Assignee: Tim Allison
Priority: Blocker

 I tried to send this to the tika user list, but got a qmail failure so I am 
 opening a jira to see if I can get help with this.
 There appears to be a change in the behavior of tika since 1.5 (the last 
 version we have used). In 1.5, if we pass a file with content type of rfc822 
 which contains a zip that contains a docx file, the entire content would get 
 recursed and the text returned. In 1.7, tika only unwinds as far as the zip 
 file and ignores the content of the contained docx file. This is causing a 
 regression failure in our search tests because the contents of the docx file 
 are not found when searched for.
  
 We are testing with tika-server if this helps. If we ask the meta service to 
 just characterize the test data, it correctly determines the input is of type 
 rfc822. However, on extract, the contents of the attachment are not extracted 
 as expected.
 curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream  
 http://localhost:9998/meta 2/dev/null | grep Content-Type
 Content-Type,message/rfc822
 curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream  
 http://localhost:9998/tika 2/dev/null | grep docx
 sign.docx   --- this is not expected, need contents of this extracted
 We can easily reproduce this problem with a simple eml file with an 
 attachment. Can someone please comment if this seems like a problem or 
 perhaps we need to change something in our call to get the old behavior?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: [DISCUSS] Tika 1.8 or 1.7.1

2015-03-28 Thread Tyler Palsulich

I'm also leaning toward 1.8. Especially given the newly identified
regression in TIKA-1584.

Tyler
On Mar 28, 2015 11:47 AM, Mattmann, Chris A (3980) 
chris.a.mattm...@jpl.nasa.gov wrote:

 Hi Tyler - I would VOTE for 1.8. Given the stuff associated
 with releasing (updating the website; sending emails; waiting
 periods, etc.) let’s ship all the updates we have too along
 with the jhighlight fix.

 Cheers,
 Chris

 ++
 Chris Mattmann, Ph.D.
 Chief Architect
 Instrument Software and Science Data Systems Section (398)
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 168-519, Mailstop: 168-527
 Email: chris.a.mattm...@nasa.gov
 WWW:  http://sunset.usc.edu/~mattmann/
 ++
 Adjunct Associate Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++






 -Original Message-
 From: Tyler Palsulich tpalsul...@apache.org
 Reply-To: dev@tika.apache.org dev@tika.apache.org
 Date: Saturday, March 28, 2015 at 8:01 AM
 To: dev@tika.apache.org dev@tika.apache.org
 Subject: [DISCUSS] Tika 1.8 or 1.7.1

 Hi Folks,
 
 Now that TIKA-1581 (JHighlight licensing issues) is resolved, we need to
 release a new version of Tika. I'll volunteer to be the release manager
 again.
 
 Should we release this as 1.8 or 1.7.1?
 
 Does anyone have any last minute issues they'd like to finish and see in
 Tika 1.X? I'd like to get the example working with CORS (TIKA-1585 and
 TIKA-1586). Any others?
 
 Have a good weekend,
 Tyler

[jira] [Commented] (TIKA-1580) ISA-Tab parsers


[ 
https://issues.apache.org/jira/browse/TIKA-1580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385541#comment-14385541
 ] 

Chris A. Mattmann commented on TIKA-1580:
-

Committed in r1669839.

Thank you [~gostep] you did amazing on this!

{noformat}
[chipotle:~/tmp/tika] mattmann% svn commit -m Fix for TIKA-1580: Support 
IsaTab MIME identification and parsing. Thanks to Giuseppe Totaro for all the 
great work!
SendingCHANGES.txt
Sendingtika-bundle/pom.xml
Sending
tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
Sendingtika-parsers/pom.xml
Adding tika-parsers/src/main/java/org/apache/tika/parser/isatab
Adding 
tika-parsers/src/main/java/org/apache/tika/parser/isatab/ISATabUtils.java
Adding 
tika-parsers/src/main/java/org/apache/tika/parser/isatab/ISArchiveParser.java
Sending
tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser
Adding tika-parsers/src/test/java/org/apache/tika/parser/isatab
Adding 
tika-parsers/src/test/java/org/apache/tika/parser/isatab/ISArchiveParserTest.java
Adding tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1
Adding 
tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/a_bii-s-2_metabolite
 profiling_NMR spectroscopy.txt
Adding 
tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/a_metabolome.txt
Adding 
tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/a_microarray.txt
Adding 
tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/a_proteome.txt
Adding 
tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/a_transcriptome.txt
Adding 
tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/i_investigation.txt
Adding 
tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/s_BII-S-1.txt
Adding 
tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/s_BII-S-2.txt
Transmitting file data 
Committed revision 1669839.
[chipotle:~/tmp/tika] mattmann% 
{noformat}

 ISA-Tab parsers
 ---

 Key: TIKA-1580
 URL: https://issues.apache.org/jira/browse/TIKA-1580
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Giuseppe Totaro
Assignee: Chris A. Mattmann
Priority: Minor
  Labels: new-parser
 Fix For: 1.8

 Attachments: TIKA-1580.Mattmann.Totaro.032515.patch.txt, 
 TIKA-1580.patch, TIKA-1580.v02.patch, 
 TIKA-1580.v03.2.Mattmann.Totaro.03262015.patch


 We are going to add parsers for ISA-Tab data formats.
 ISA-Tab files are related to [ISA Tools|http://www.isa-tools.org/] which help 
 to manage an increasingly diverse set of life science, environmental and 
 biomedical experiments that employing one or a combination of technologies.
 The ISA tools are built upon _Investigation_, _Study_, and _Assay_ tabular 
 format. Therefore, ISA-Tab data format includes three types of file: 
 Investigation file ({{a_.txt}}), Study file ({{s_.txt}}), Assay file 
 ({{a_.txt}}). These files are organized as [top-down 
 hierarchy|http://www.isa-tools.org/format/specification/]: An Investigation 
 file includes one or more Study files: each Study files includes one or more 
 Assay files.
 Essentially, the Investigation files contains high-level information about 
 the related study, so it provides only metadata about ISA-Tab files.
 More details on file format specification are [available 
 online|http://isatab.sourceforge.net/docs/ISA-TAB_release-candidate-1_v1.0_24nov08.pdf].
 The patch in attachment provides a preliminary version of ISA-Tab parsers 
 (there are three parsers; one parser for each ISA-Tab filetype):
 * {{ISATabInvestigationParser.java}}: parses Investigation files. It extracts 
 only metadata.
 * {{ISATabStudyParser.java}}: parses Study files.
 * {{ISATabAssayParser.java}}: parses Assay files.
 The most important improvements are:
 * Combine these three parsers in order to parse an ISArchive
 * Provide a better mapping of both study and assay data on XHML. Currently, 
 {{ISATabStudyParser}} and {{ISATabAssayParser}} provide a naive mapping 
 function relying on [Apache Commons 
 CSV|https://commons.apache.org/proper/commons-csv/].
 Thanks for supporting me on this work [~chrismattmann]. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (TIKA-1580) ISA-Tab parsers


 [ 
https://issues.apache.org/jira/browse/TIKA-1580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann resolved TIKA-1580.
-
Resolution: Fixed

 ISA-Tab parsers
 ---

 Key: TIKA-1580
 URL: https://issues.apache.org/jira/browse/TIKA-1580
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Giuseppe Totaro
Assignee: Chris A. Mattmann
Priority: Minor
  Labels: new-parser
 Fix For: 1.8

 Attachments: TIKA-1580.Mattmann.Totaro.032515.patch.txt, 
 TIKA-1580.patch, TIKA-1580.v02.patch, 
 TIKA-1580.v03.2.Mattmann.Totaro.03262015.patch


 We are going to add parsers for ISA-Tab data formats.
 ISA-Tab files are related to [ISA Tools|http://www.isa-tools.org/] which help 
 to manage an increasingly diverse set of life science, environmental and 
 biomedical experiments that employing one or a combination of technologies.
 The ISA tools are built upon _Investigation_, _Study_, and _Assay_ tabular 
 format. Therefore, ISA-Tab data format includes three types of file: 
 Investigation file ({{a_.txt}}), Study file ({{s_.txt}}), Assay file 
 ({{a_.txt}}). These files are organized as [top-down 
 hierarchy|http://www.isa-tools.org/format/specification/]: An Investigation 
 file includes one or more Study files: each Study files includes one or more 
 Assay files.
 Essentially, the Investigation files contains high-level information about 
 the related study, so it provides only metadata about ISA-Tab files.
 More details on file format specification are [available 
 online|http://isatab.sourceforge.net/docs/ISA-TAB_release-candidate-1_v1.0_24nov08.pdf].
 The patch in attachment provides a preliminary version of ISA-Tab parsers 
 (there are three parsers; one parser for each ISA-Tab filetype):
 * {{ISATabInvestigationParser.java}}: parses Investigation files. It extracts 
 only metadata.
 * {{ISATabStudyParser.java}}: parses Study files.
 * {{ISATabAssayParser.java}}: parses Assay files.
 The most important improvements are:
 * Combine these three parsers in order to parse an ISArchive
 * Provide a better mapping of both study and assay data on XHML. Currently, 
 {{ISATabStudyParser}} and {{ISATabAssayParser}} provide a naive mapping 
 function relying on [Apache Commons 
 CSV|https://commons.apache.org/proper/commons-csv/].
 Thanks for supporting me on this work [~chrismattmann]. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (TIKA-1584) Tika 1.7 possible regression (nested attachment files not getting parsed)

[
https://issues.apache.org/jira/browse/TIKA-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385463#comment-14385463
]

Tim Allison edited comment on TIKA-1584 at 3/28/15 6:53 PM:

Just checked svn. That's a major regression added in 1.7 when we added
specification of ParseContext. We need to add the Parser to the ParseContext to
get recursive parsing. W/o use of ParseContext in call to parse, the parser
used to work recursively. Will fix Monday unless someone beats me to it. Thank
you for raising this. No need to attach test doc.

was (Author: talli...@mitre.org):
Just checked svn. That's a major regression added in 1.7 when we added
specification of ParseContext. We need to add the Parser to the ParseContext to
get recursive parsing. W/o use of ParseContext in call to parse, the parser
works recursively. Will fix Monday unless someone beats me to it. Thank you for
raising this. No need to attach test doc.

Tika 1.7 possible regression (nested attachment files not getting parsed)
-

Key: TIKA-1584
URL: https://issues.apache.org/jira/browse/TIKA-1584
Project: Tika
Issue Type: Bug
Components: server
Affects Versions: 1.7
Reporter: Rob Tulloh
Priority: Blocker

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1584) Tika 1.7 possible regression (nested attachment files not getting parsed)

[
https://issues.apache.org/jira/browse/TIKA-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385463#comment-14385463
]

Tim Allison commented on TIKA-1584:
---

Just checked svn. That's a major regression added in 1.7 when we added
specification of ParseContext. We need to add the Parser to the ParseContext to
get recursive parsing. W/o use of ParseContext in call to parse, the parser
works recursively. Will fix Monday unless someone beats me to it. Thank you for
raising this. No need to attach test doc.

Tika 1.7 possible regression (nested attachment files not getting parsed)
-

Key: TIKA-1584
URL: https://issues.apache.org/jira/browse/TIKA-1584
Project: Tika
Issue Type: Bug
Components: server
Affects Versions: 1.7
Reporter: Rob Tulloh

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1584) Tika 1.7 possible regression (nested attachment files not getting parsed)

[
https://issues.apache.org/jira/browse/TIKA-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385483#comment-14385483
]

Tyler Palsulich commented on TIKA-1584:
---

We now have two major issues which need a quick release. So, I would say go for
1.8. Tim, can you chime in on the current discuss thread?

Tika 1.7 possible regression (nested attachment files not getting parsed)
-

Key: TIKA-1584
URL: https://issues.apache.org/jira/browse/TIKA-1584
Project: Tika
Issue Type: Bug
Components: server
Affects Versions: 1.7
Reporter: Rob Tulloh
Assignee: Tim Allison
Priority: Blocker

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1580) ISA-Tab parsers

2015-03-28 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385558#comment-14385558
 ] 

Hudson commented on TIKA-1580:
--

SUCCESS: Integrated in tika-trunk-jdk1.7 #579 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/579/])
Fix for TIKA-1580: Support IsaTab MIME identification and parsing. Thanks to 
Giuseppe Totaro for all the great work! (mattmann: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1669839)
* /tika/trunk/CHANGES.txt
* /tika/trunk/tika-bundle/pom.xml
* 
/tika/trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
* /tika/trunk/tika-parsers/pom.xml
* /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/isatab
* 
/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/isatab/ISATabUtils.java
* 
/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/isatab/ISArchiveParser.java
* 
/tika/trunk/tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser
* /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/isatab
* 
/tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/isatab/ISArchiveParserTest.java
* /tika/trunk/tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1
* 
/tika/trunk/tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/a_bii-s-2_metabolite
 profiling_NMR spectroscopy.txt
* 
/tika/trunk/tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/a_metabolome.txt
* 
/tika/trunk/tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/a_microarray.txt
* 
/tika/trunk/tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/a_proteome.txt
* 
/tika/trunk/tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/a_transcriptome.txt
* 
/tika/trunk/tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/i_investigation.txt
* 
/tika/trunk/tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/s_BII-S-1.txt
* 
/tika/trunk/tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/s_BII-S-2.txt


 ISA-Tab parsers
 ---

 Key: TIKA-1580
 URL: https://issues.apache.org/jira/browse/TIKA-1580
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Giuseppe Totaro
Assignee: Chris A. Mattmann
Priority: Minor
  Labels: new-parser
 Fix For: 1.8

 Attachments: TIKA-1580.Mattmann.Totaro.032515.patch.txt, 
 TIKA-1580.patch, TIKA-1580.v02.patch, 
 TIKA-1580.v03.2.Mattmann.Totaro.03262015.patch


 We are going to add parsers for ISA-Tab data formats.
 ISA-Tab files are related to [ISA Tools|http://www.isa-tools.org/] which help 
 to manage an increasingly diverse set of life science, environmental and 
 biomedical experiments that employing one or a combination of technologies.
 The ISA tools are built upon _Investigation_, _Study_, and _Assay_ tabular 
 format. Therefore, ISA-Tab data format includes three types of file: 
 Investigation file ({{a_.txt}}), Study file ({{s_.txt}}), Assay file 
 ({{a_.txt}}). These files are organized as [top-down 
 hierarchy|http://www.isa-tools.org/format/specification/]: An Investigation 
 file includes one or more Study files: each Study files includes one or more 
 Assay files.
 Essentially, the Investigation files contains high-level information about 
 the related study, so it provides only metadata about ISA-Tab files.
 More details on file format specification are [available 
 online|http://isatab.sourceforge.net/docs/ISA-TAB_release-candidate-1_v1.0_24nov08.pdf].
 The patch in attachment provides a preliminary version of ISA-Tab parsers 
 (there are three parsers; one parser for each ISA-Tab filetype):
 * {{ISATabInvestigationParser.java}}: parses Investigation files. It extracts 
 only metadata.
 * {{ISATabStudyParser.java}}: parses Study files.
 * {{ISATabAssayParser.java}}: parses Assay files.
 The most important improvements are:
 * Combine these three parsers in order to parse an ISArchive
 * Provide a better mapping of both study and assay data on XHML. Currently, 
 {{ISATabStudyParser}} and {{ISATabAssayParser}} provide a naive mapping 
 function relying on [Apache Commons 
 CSV|https://commons.apache.org/proper/commons-csv/].
 Thanks for supporting me on this work [~chrismattmann]. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1584) Tika 1.7 possible regression (nested attachment files not getting parsed)

[
https://issues.apache.org/jira/browse/TIKA-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385472#comment-14385472
]

Rob Tulloh commented on TIKA-1584:
--

Thank you. For what it's worth, it easy to reproduce. Just zip any document you
want and then pass the zip file to tika server and see what it gives back. As
1.7 is released, does this mean that this won't be fixed until 1.8 or would 1.7
get re-released/patched?

Tika 1.7 possible regression (nested attachment files not getting parsed)
-

Key: TIKA-1584
URL: https://issues.apache.org/jira/browse/TIKA-1584
Project: Tika
Issue Type: Bug
Components: server
Affects Versions: 1.7
Reporter: Rob Tulloh
Priority: Blocker

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1580) ISA-Tab parsers


[ 
https://issues.apache.org/jira/browse/TIKA-1580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385540#comment-14385540
 ] 

Chris A. Mattmann commented on TIKA-1580:
-

Built and unit tested successfully. Deployed in production on bioinformatics 
project. Works great!
Committing now.
{noformat}
[INFO] 
[INFO] Reactor Summary:
[INFO] 
[INFO] Apache Tika parent . SUCCESS [  1.994 s]
[INFO] Apache Tika core ... SUCCESS [ 19.033 s]
[INFO] Apache Tika parsers  SUCCESS [02:29 min]
[INFO] Apache Tika XMP  SUCCESS [  2.675 s]
[INFO] Apache Tika serialization .. SUCCESS [  2.082 s]
[INFO] Apache Tika batch .. SUCCESS [01:57 min]
[INFO] Apache Tika application  SUCCESS [ 13.227 s]
[INFO] Apache Tika OSGi bundle  SUCCESS [ 18.103 s]
[INFO] Apache Tika server . SUCCESS [ 22.457 s]
[INFO] Apache Tika translate .. SUCCESS [  3.347 s]
[INFO] Apache Tika examples ... SUCCESS [  6.103 s]
[INFO] Apache Tika Java-7 Components .. SUCCESS [  2.039 s]
[INFO] Apache Tika  SUCCESS [  0.030 s]
[INFO] 
[INFO] BUILD SUCCESS
[INFO] 
[INFO] Total time: 05:58 min
[INFO] Finished at: 2015-03-28T14:45:20-07:00
[INFO] Final Memory: 103M/1592M
[INFO] 
[chipotle:~/tmp/tika] mattmann% 
{noformat}

 ISA-Tab parsers
 ---

 Key: TIKA-1580
 URL: https://issues.apache.org/jira/browse/TIKA-1580
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Giuseppe Totaro
Assignee: Chris A. Mattmann
Priority: Minor
  Labels: new-parser
 Fix For: 1.8

 Attachments: TIKA-1580.Mattmann.Totaro.032515.patch.txt, 
 TIKA-1580.patch, TIKA-1580.v02.patch, 
 TIKA-1580.v03.2.Mattmann.Totaro.03262015.patch


 We are going to add parsers for ISA-Tab data formats.
 ISA-Tab files are related to [ISA Tools|http://www.isa-tools.org/] which help 
 to manage an increasingly diverse set of life science, environmental and 
 biomedical experiments that employing one or a combination of technologies.
 The ISA tools are built upon _Investigation_, _Study_, and _Assay_ tabular 
 format. Therefore, ISA-Tab data format includes three types of file: 
 Investigation file ({{a_.txt}}), Study file ({{s_.txt}}), Assay file 
 ({{a_.txt}}). These files are organized as [top-down 
 hierarchy|http://www.isa-tools.org/format/specification/]: An Investigation 
 file includes one or more Study files: each Study files includes one or more 
 Assay files.
 Essentially, the Investigation files contains high-level information about 
 the related study, so it provides only metadata about ISA-Tab files.
 More details on file format specification are [available 
 online|http://isatab.sourceforge.net/docs/ISA-TAB_release-candidate-1_v1.0_24nov08.pdf].
 The patch in attachment provides a preliminary version of ISA-Tab parsers 
 (there are three parsers; one parser for each ISA-Tab filetype):
 * {{ISATabInvestigationParser.java}}: parses Investigation files. It extracts 
 only metadata.
 * {{ISATabStudyParser.java}}: parses Study files.
 * {{ISATabAssayParser.java}}: parses Assay files.
 The most important improvements are:
 * Combine these three parsers in order to parse an ISArchive
 * Provide a better mapping of both study and assay data on XHML. Currently, 
 {{ISATabStudyParser}} and {{ISATabAssayParser}} provide a naive mapping 
 function relying on [Apache Commons 
 CSV|https://commons.apache.org/proper/commons-csv/].
 Thanks for supporting me on this work [~chrismattmann]. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (TIKA-1584) Tika 1.7 possible regression (nested attachment files not getting parsed)

Rob Tulloh created TIKA-1584:


 Summary: Tika 1.7 possible regression (nested attachment files not 
getting parsed)
 Key: TIKA-1584
 URL: https://issues.apache.org/jira/browse/TIKA-1584
 Project: Tika
  Issue Type: Bug
  Components: server
Affects Versions: 1.7
Reporter: Rob Tulloh


I tried to send this to the tika user list, but got a qmail failure so I am 
opening a jira to see if I can get help with this.

There appears to be a change in the behavior of tika since 1.5 (the last 
version we have used). In 1.5, if we pass a file with content type of rfc822 
which contains a zip that contains a docx file, the entire content would get 
recursed and the text returned. In 1.7, tika only unwinds as far as the zip 
file and ignores the content of the contained docx file. This is causing a 
regression failure in our search tests because the contents of the docx file 
are not found when searched for.
 
We are testing with tika-server if this helps. If we ask the meta service to 
just characterize the test data, it correctly determines the input is of type 
rfc822. However, on extract, the contents of the attachment are not extracted 
as expected.

curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream  
http://localhost:9998/meta 2/dev/null | grep Content-Type
Content-Type,message/rfc822

curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream  
http://localhost:9998/tika 2/dev/null | grep docx
sign.docx   --- this is not expected, need contents of this extracted


We can easily reproduce this problem with a simple eml file with an attachment. 
Can someone please comment if this seems like a problem or perhaps we need to 
change something in our call to get the old behavior?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (TIKA-1585) Create Example Website with Form Submission

Tyler Palsulich created TIKA-1585:
-

 Summary: Create Example Website with Form Submission
 Key: TIKA-1585
 URL: https://issues.apache.org/jira/browse/TIKA-1585
 Project: Tika
  Issue Type: New Feature
  Components: example, server
Reporter: Tyler Palsulich
Assignee: Tyler Palsulich


It would be great to have a website where we can direct people who ask what 
Tika can do for [filetype] without needing them to actually download Tika.

Some initial work to do that is 
[here|http://tpalsulich.github.io/TikaExamples/].

I'm far from a design guru, but I imagine the site as having a form where you 
can upload a file at the top, checkboxes for if you want metadata, content, or 
both, and a submit button. The request should be sent with AJAX and the result 
should populate a {{div}}.

One issue with AJAX requests is that Tika Server doesn't currently allow 
Cross-Origin-Resource-Sharing (CORS). So, we either need to maintain a slightly 
updated tika-server, or update the server to allow configuration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1584) Tika 1.7 possible regression (nested attachment files not getting parsed)


[ 
https://issues.apache.org/jira/browse/TIKA-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385328#comment-14385328
 ] 

Rob Tulloh commented on TIKA-1584:
--

If the .zip file is passed to tika, it shows the same behavior.

{noformat}
curl -X PUT -T sign.zip -H Content-Type:application/octet-stream  
http://localhost:9998/tika 2/dev/null

sign.docx

{noformat}

 Tika 1.7 possible regression (nested attachment files not getting parsed)
 -

 Key: TIKA-1584
 URL: https://issues.apache.org/jira/browse/TIKA-1584
 Project: Tika
  Issue Type: Bug
  Components: server
Affects Versions: 1.7
Reporter: Rob Tulloh

 I tried to send this to the tika user list, but got a qmail failure so I am 
 opening a jira to see if I can get help with this.
 There appears to be a change in the behavior of tika since 1.5 (the last 
 version we have used). In 1.5, if we pass a file with content type of rfc822 
 which contains a zip that contains a docx file, the entire content would get 
 recursed and the text returned. In 1.7, tika only unwinds as far as the zip 
 file and ignores the content of the contained docx file. This is causing a 
 regression failure in our search tests because the contents of the docx file 
 are not found when searched for.
  
 We are testing with tika-server if this helps. If we ask the meta service to 
 just characterize the test data, it correctly determines the input is of type 
 rfc822. However, on extract, the contents of the attachment are not extracted 
 as expected.
 curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream  
 http://localhost:9998/meta 2/dev/null | grep Content-Type
 Content-Type,message/rfc822
 curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream  
 http://localhost:9998/tika 2/dev/null | grep docx
 sign.docx   --- this is not expected, need contents of this extracted
 We can easily reproduce this problem with a simple eml file with an 
 attachment. Can someone please comment if this seems like a problem or 
 perhaps we need to change something in our call to get the old behavior?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1581) jhighlight license concerns

2015-03-28 Thread Ken Krugler (JIRA)

[
https://issues.apache.org/jira/browse/TIKA-1581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385331#comment-14385331
]

Ken Krugler commented on TIKA-1581:
---

Hi Tyler,

JHighlight has been updated in Central, and Tika is now using that version.

So I believe it's resolved, as long as the changes that changes that Hong-Thai
made to the NOTICE.txt are sufficient for the CDDL license used by jhighlight.

And yes, as per my comment above we'll need to release a new version of Tika
for downstream libraries. Seems like it could be worth a quick dot release for
ManifoldCF/Lucene.

jhighlight license concerns
---

Key: TIKA-1581
URL: https://issues.apache.org/jira/browse/TIKA-1581
Project: Tika
Issue Type: Bug
Affects Versions: 1.7
Reporter: Karl Wright
Fix For: 1.8

jhighlight jar is a Tika dependency. The Lucene team discovered that, while
it claims to be a CDDL/LGPL dual-license, some of its functionality is LGPL
only:
{code}
Solr's contrib/extraction contains jhighlight-1.0.jar which declares itself
as dual CDDL or LGPL license. However, some of its classes are distributed
only under LGPL, e.g.
com.uwyn.jhighlight.highlighter.
CppHighlighter.java
GroovyHighlighter.java
JavaHighlighter.java
XmlHighlighter.java
I downloaded the sources from Maven
(http://search.maven.org/remotecontent?filepath=com/uwyn/jhighlight/1.0/jhighlight-1.0-sources.jar)
to confirm that, and also found this SVN repo:
http://svn.rifers.org/jhighlight/tags/release-1.0, though the project's
website seems to not exist anymore (https://jhighlight.dev.java.net/).
I didn't find any direct usage of it in our code, so I guess it's probably
needed by a 3rd party dependency, such as Tika. Therefore if we e.g. omit it,
things will compile, but may fail at runtime.
{code}
Is it possible to remove this dependency for future releases, or allow only
optional inclusion of this package? It is of concern to the ManifoldCF
project because we distribute a binary package that includes Tika and its
required dependencies, which currently includes jHighlight.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1581) jhighlight license concerns

2015-03-28 Thread Ken Krugler (JIRA)

[
https://issues.apache.org/jira/browse/TIKA-1581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385346#comment-14385346
]

Ken Krugler commented on TIKA-1581:
---

Based on what I see in other projects (e.g. the Lucene NOTICE.txt file) this
seems to be following standard practices, so I'm going to assume it's OK.

jhighlight license concerns
---

Key: TIKA-1581
URL: https://issues.apache.org/jira/browse/TIKA-1581
Project: Tika
Issue Type: Bug
Affects Versions: 1.7
Reporter: Karl Wright
Fix For: 1.8

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (TIKA-1526) ExternalParser should trap/ignore/workarround JDK-8047340 JDK-8055301 so Turkish Tika users can still use non-external parsers


 [ 
https://issues.apache.org/jira/browse/TIKA-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich resolved TIKA-1526.
---
Resolution: Fixed

Marking this as Fixed, per the above comments. [~thetaphi] or [~hossman] or 
anyone else, please reopen this if you find any other cases.

Thank you everyone for the help!

 ExternalParser should trap/ignore/workarround JDK-8047340  JDK-8055301 so 
 Turkish Tika users can still use non-external parsers
 

 Key: TIKA-1526
 URL: https://issues.apache.org/jira/browse/TIKA-1526
 Project: Tika
  Issue Type: Wish
Reporter: Hoss Man

 the JDK has numerous pain points regarding the Turkish locale, posix_spawn 
 lowercasing being one of them...
 https://bugs.openjdk.java.net/browse/JDK-8047340
 https://bugs.openjdk.java.net/browse/JDK-8055301
 As of Tika 1.7, the TesseractOCRParser (which is an ExternalParser) is 
 enabled  configured by default in Tika, and uses ExternalParser.check to see 
 if tesseract is available -- but because of the JDK bug, this means that Tika 
 fails fast for Turkish users on BSD/UNIX variants (including MacOSX) like 
 so...
 {noformat}
   [junit4] Throwable #1: java.lang.Error: posix_spawn is not a supported 
 process launch mechanism on this platform.
   [junit4]   at java.lang.UNIXProcess$1.run(UNIXProcess.java:105)
   [junit4]   at java.lang.UNIXProcess$1.run(UNIXProcess.java:94)
   [junit4]   at java.security.AccessController.doPrivileged(Native 
 Method)
   [junit4]   at java.lang.UNIXProcess.clinit(UNIXProcess.java:92)
   [junit4]   at java.lang.ProcessImpl.start(ProcessImpl.java:130)
   [junit4]   at 
 java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
   [junit4]   at java.lang.Runtime.exec(Runtime.java:620)
   [junit4]   at java.lang.Runtime.exec(Runtime.java:485)
   [junit4]   at 
 org.apache.tika.parser.external.ExternalParser.check(ExternalParser.java:344)
   [junit4]   at 
 org.apache.tika.parser.ocr.TesseractOCRParser.hasTesseract(TesseractOCRParser.java:117)
   [junit4]   at 
 org.apache.tika.parser.ocr.TesseractOCRParser.getSupportedTypes(TesseractOCRParser.java:90)
   [junit4]   at 
 org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81)
   [junit4]   at 
 org.apache.tika.parser.DefaultParser.getParsers(DefaultParser.java:95)
   [junit4]   at 
 org.apache.tika.parser.CompositeParser.getSupportedTypes(CompositeParser.java:229)
   [junit4]   at 
 org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81)
   [junit4]   at 
 org.apache.tika.parser.CompositeParser.getParser(CompositeParser.java:209)
   [junit4]   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
   [junit4]   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
 {noformat}
 ...unless they go out of their way to white list only the parsers they 
 need/want so TesseractOCRParser (and any other ExternalParsers) will never 
 even be check()ed.
 It would be nice if Tika's ExternalParser class added a similar 
 hack/workarround to what was done in SOLR-6387 to trap these types of errors. 
  In Solr we just propogate a better error explaining why Java hates the 
 turkish langauge...
 {code}
 } catch (Error err) {
   if (err.getMessage() != null  (err.getMessage().contains(posix_spawn) 
 || err.getMessage().contains(UNIXProcess))) {
 log.warn(Error forking command due to JVM locale bug (see 
 https://issues.apache.org/jira/browse/SOLR-6387):  + err.getMessage());
 return (error executing:  + cmd + );
   }
 }
 {code}
 ...but with Tika, it might be better for all ExternalParsers to just opt 
 out as if they don't recognize the filetype when they detect this type of 
 error fro m the check method (or perhaps it would be better if 
 AutoDetectParser handled this? ... i'm not really sure how it would best fit 
 into Tika's architecture)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1581) jhighlight license concerns

[
https://issues.apache.org/jira/browse/TIKA-1581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385337#comment-14385337
]

Tyler Palsulich commented on TIKA-1581:
---

Hi [~kkrugler]. Thanks. The comment is now
bq. Tika-parsers component uses CDDL/LGPL dual-licensed dependency: jhighlight
(https://github.com/codelibs/jhighlight)

If this looks good, I'll start a \[DISCUSS\] thread on the list about a new
version.

jhighlight license concerns
---

Key: TIKA-1581
URL: https://issues.apache.org/jira/browse/TIKA-1581
Project: Tika
Issue Type: Bug
Affects Versions: 1.7
Reporter: Karl Wright
Fix For: 1.8

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (TIKA-1586) Enable CORS on Tika Server

Tyler Palsulich created TIKA-1586:
-

 Summary: Enable CORS on Tika Server
 Key: TIKA-1586
 URL: https://issues.apache.org/jira/browse/TIKA-1586
 Project: Tika
  Issue Type: New Feature
  Components: server
Reporter: Tyler Palsulich
Assignee: Tyler Palsulich


Tika Server should allow configuration of CORS requests (for uses like 
TIKA-1585). See [this example|http://cxf.apache.org/docs/jax-rs-cors.html] from 
CXF for how to add it.

The only change from that site is that we will need to add a 
{{CrossOriginResourceSharingFilter}} as a provider.

Ideally, this is configurable (limit which resources have CORS, and which 
origins are allowed). But, I'm not thinking of any general methods of how to do 
that...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] tika pull request: TIKA-1586. Enable CORS requests on Tika server

2015-03-28 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/tika/pull/37


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Commented] (TIKA-1586) Enable CORS on Tika Server

2015-03-28 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385398#comment-14385398
 ] 

ASF GitHub Bot commented on TIKA-1586:
--

Github user asfgit closed the pull request at:

https://github.com/apache/tika/pull/37


 Enable CORS on Tika Server
 --

 Key: TIKA-1586
 URL: https://issues.apache.org/jira/browse/TIKA-1586
 Project: Tika
  Issue Type: New Feature
  Components: server
Reporter: Tyler Palsulich
Assignee: Tyler Palsulich

 Tika Server should allow configuration of CORS requests (for uses like 
 TIKA-1585). See [this example|http://cxf.apache.org/docs/jax-rs-cors.html] 
 from CXF for how to add it.
 The only change from that site is that we will need to add a 
 {{CrossOriginResourceSharingFilter}} as a provider.
 Ideally, this is configurable (limit which resources have CORS, and which 
 origins are allowed). But, I'm not thinking of any general methods of how to 
 do that...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (TIKA-1586) Enable CORS on Tika Server


 [ 
https://issues.apache.org/jira/browse/TIKA-1586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich resolved TIKA-1586.
---
Resolution: Fixed

Fixed in r1669799.

 Enable CORS on Tika Server
 --

 Key: TIKA-1586
 URL: https://issues.apache.org/jira/browse/TIKA-1586
 Project: Tika
  Issue Type: New Feature
  Components: server
Reporter: Tyler Palsulich
Assignee: Tyler Palsulich

 Tika Server should allow configuration of CORS requests (for uses like 
 TIKA-1585). See [this example|http://cxf.apache.org/docs/jax-rs-cors.html] 
 from CXF for how to add it.
 The only change from that site is that we will need to add a 
 {{CrossOriginResourceSharingFilter}} as a provider.
 Ideally, this is configurable (limit which resources have CORS, and which 
 origins are allowed). But, I'm not thinking of any general methods of how to 
 do that...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1354) ForkParser doesn't work in OSGI container


[ 
https://issues.apache.org/jira/browse/TIKA-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385403#comment-14385403
 ] 

Chris A. Mattmann commented on TIKA-1354:
-

thanks!

 ForkParser doesn't work in OSGI container
 -

 Key: TIKA-1354
 URL: https://issues.apache.org/jira/browse/TIKA-1354
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.6
Reporter: Michal Hlavac
 Fix For: 1.7


 I can't find way to run ForkParser in OSGI container.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1577) NetCDF Data Extraction


[ 
https://issues.apache.org/jira/browse/TIKA-1577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385408#comment-14385408
 ] 

Chris A. Mattmann commented on TIKA-1577:
-

Agreed, if we can reuse this, then great. The one catch is that I'm not sure 
that dump capability generates a table or something in an XHTML representation 
which is our basis representation in Tika. I would like us to consider the 
output of this issue to be:

- TikaParser generates XHTML tabular and other elements that represent the data 
in the NetCDF file
- we create like a ScientificContentHandler that can then take that output from 
the parser (in the data section) and then format it e.g., like NCDump. 

Sound good?

 NetCDF Data Extraction
 --

 Key: TIKA-1577
 URL: https://issues.apache.org/jira/browse/TIKA-1577
 Project: Tika
  Issue Type: Improvement
  Components: handler, parser
Affects Versions: 1.7
Reporter: Ann Burgess
Assignee: Ann Burgess
  Labels: features, handler
 Fix For: 1.8

   Original Estimate: 504h
  Remaining Estimate: 504h

 A netCDF classic or 64-bit offset dataset is stored as a single file 
 comprising two parts:
  - a header, containing all the information about dimensions, attributes, and 
 variables except for the variable data;
  - a data part, comprising fixed-size data, containing the data for variables 
 that don't have an unlimited dimension; and variable-size data, containing 
 the data for variables that have an unlimited dimension.
 The NetCDFparser currently extracts the header part.  
  -- text extracts file Dimensions and Variables
  -- metadata extracts Global Attributes
 We want the option to extract the data part of NetCDF files.  
 Lets use the NetCDF test file for our dev testing:  
 tika/tika-parsers/src/test/resources/test-documents/sresa1b_ncar_ccsm3_0_run1_21.nc
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1586) Enable CORS on Tika Server

2015-03-28 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385407#comment-14385407
 ] 

Hudson commented on TIKA-1586:
--

SUCCESS: Integrated in tika-trunk-jdk1.7 #578 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/578/])
TIKA-1586. Enable CORS requests on Tika server.

This fixes #37. (tpalsulich: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1669799)
* /tika/trunk/CHANGES.txt
* /tika/trunk/tika-server/pom.xml
* 
/tika/trunk/tika-server/src/main/java/org/apache/tika/server/TikaServerCli.java


 Enable CORS on Tika Server
 --

 Key: TIKA-1586
 URL: https://issues.apache.org/jira/browse/TIKA-1586
 Project: Tika
  Issue Type: New Feature
  Components: server
Reporter: Tyler Palsulich
Assignee: Tyler Palsulich

 Tika Server should allow configuration of CORS requests (for uses like 
 TIKA-1585). See [this example|http://cxf.apache.org/docs/jax-rs-cors.html] 
 from CXF for how to add it.
 The only change from that site is that we will need to add a 
 {{CrossOriginResourceSharingFilter}} as a provider.
 Ideally, this is configurable (limit which resources have CORS, and which 
 origins are allowed). But, I'm not thinking of any general methods of how to 
 do that...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: trunk test failure

2015-03-28 Thread Mattmann, Chris A (3980)

Thanks Oleg!

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++






-Original Message-
From: Oleg Tikhonov olegtikho...@gmail.com
Reply-To: dev@tika.apache.org dev@tika.apache.org
Date: Thursday, March 26, 2015 at 12:19 AM
To: dev@tika.apache.org dev@tika.apache.org
Subject: Re: trunk test failure

Hi Chris,
just to confirm:

[INFO]

[INFO] Reactor Summary:
[INFO]
[INFO] Apache Tika parent . SUCCESS [
9.268 s]
[INFO] Apache Tika core ... SUCCESS [
25.823 s]
[INFO] Apache Tika parsers  SUCCESS [02:41
min]
[INFO] Apache Tika XMP  SUCCESS [
1.986 s]
[INFO] Apache Tika serialization .. SUCCESS [
1.604 s]
[INFO] Apache Tika batch .. SUCCESS [02:02
min]
[INFO] Apache Tika application  SUCCESS [
18.983 s]
[INFO] Apache Tika OSGi bundle  SUCCESS [
29.087 s]
[INFO] Apache Tika server . SUCCESS [
46.706 s]
[INFO] Apache Tika translate .. SUCCESS [
9.163 s]
[INFO] Apache Tika examples ... SUCCESS [
4.134 s]
[INFO] Apache Tika Java-7 Components .. SUCCESS [
1.236 s]
[INFO] Apache Tika  SUCCESS [
0.017 s]
[INFO]

[INFO] BUILD SUCCESS
[INFO]

[INFO] Total time: 07:20 min
[INFO] Finished at: 2015-03-26T09:18:46+02:00
[INFO] Final Memory: 91M/848M
[INFO]



BR,
OLeg

On Thu, Mar 26, 2015 at 1:21 AM, Mattmann, Chris A (3980) 
chris.a.mattm...@jpl.nasa.gov wrote:

 OK I am nuts - I was applying the patch from TIKA-1580, but didn’t
 update Felix in the bundle pom - done now, building again. Yay.


 ++
 Chris Mattmann, Ph.D.
 Chief Architect
 Instrument Software and Science Data Systems Section (398)
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 168-519, Mailstop: 168-527
 Email: chris.a.mattm...@nasa.gov
 WWW:  http://sunset.usc.edu/~mattmann/
 ++
 Adjunct Associate Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++






 -Original Message-
 From: Mattmann, Chris Mattmann chris.a.mattm...@jpl.nasa.gov
 Reply-To: dev@tika.apache.org dev@tika.apache.org
 Date: Wednesday, March 25, 2015 at 6:57 PM
 To: dev@tika.apache.org dev@tika.apache.org
 Subject: trunk test failure

 Hey Anyone else seeing this failure in trunk?
 
 Running org.apache.tika.bundle.BundleIT
 [main] INFO org.ops4j.pax.exam.spi.DefaultExamSystem - Pax Exam System
 (Version: 4.4.0) created.
 [main] INFO org.ops4j.pax.exam.junit.impl.ProbeRunner - creating
PaxExam
 runner for class org.apache.tika.bundle.BundleIT
 [main] INFO org.ops4j.pax.exam.junit.impl.ProbeRunner - running test
class
 org.apache.tika.bundle.BundleIT
 ERROR: Bundle org.apache.tika.bundle [17] Error starting
 
file:/Users/mattmann/tmp/tika/tika-bundle/target/test-bundles/tika-bundl
e.
 j
 ar (org.osgi.framework.BundleException: Unresolved constraint in bundle
 org.apache.tika.bundle [17]: Unable to resolve 17.0: missing
requirement
 [17.0] osgi.wiring.package;
 
((osgi.wiring.package=org.apache.commons.csv)(version=1.0.0)(!(version
=
 2
 .0.0
 org.osgi.framework.BundleException: Unresolved constraint in bundle
 org.apache.tika.bundle [17]: Unable to resolve 17.0: missing
requirement
 [17.0] osgi.wiring.package;
 
((osgi.wiring.package=org.apache.commons.csv)(version=1.0.0)(!(version
=
 2
 .0.0)))
at
 org.apache.felix.framework.Felix.resolveBundleRevision(Felix.java:4097)
at org.apache.felix.framework.Felix.startBundle(Felix.java:2114)
at
 org.apache.felix.framework.Felix.setActiveStartLevel(Felix.java:1368)
at
 
org.apache.felix.framework.FrameworkStartLevelImpl.run(FrameworkStartLev
el
 I
 mpl.java:308)
at java.lang.Thread.run(Thread.java:745)
 [main] ERROR

[jira] [Commented] (TIKA-1585) Create Example Website with Form Submission

[
https://issues.apache.org/jira/browse/TIKA-1585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385411#comment-14385411
]

Tyler Palsulich commented on TIKA-1585:
---

CORS work is now integrated. [~talli...@mitre.org], can you restart the server
on 162.242.228.174:9998 with the --cors http://tpalsulich.github.io; option?

Then, we can close off the 9997 port (my github.io site is querying 9997,
though, so I'll need to update that).

Is there an official place we'd like to host the above site?

Create Example Website with Form Submission
---

Key: TIKA-1585
URL: https://issues.apache.org/jira/browse/TIKA-1585
Project: Tika
Issue Type: New Feature
Components: example, server
Reporter: Tyler Palsulich
Assignee: Tyler Palsulich

It would be great to have a website where we can direct people who ask what
Tika can do for [filetype] without needing them to actually download Tika.
Some initial work to do that is
[here|http://tpalsulich.github.io/TikaExamples/].
I'm far from a design guru, but I imagine the site as having a form where you
can upload a file at the top, checkboxes for if you want metadata, content,
or both, and a submit button. The request should be sent with AJAX and the
result should populate a {{div}}.
One issue with AJAX requests is that Tika Server doesn't currently allow
Cross-Origin-Resource-Sharing (CORS). So, we either need to maintain a
slightly updated tika-server, or update the server to allow configuration.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

RE: [DISCUSS] Tika 1.8 or 1.7.1

2015-03-28 Thread Ken Krugler

Given how recently we did a 1.7 release, my vote would be for 1.7.1

And to keep this release as simple as possible, just cherry-pick the fix for 
TIKA-1581 into the 1.7 code base.

-- Ken

 From: Tyler Palsulich
 Sent: March 28, 2015 8:01:03am PDT
 To: dev@tika.apache.org
 Subject: [DISCUSS] Tika 1.8 or 1.7.1

 Hi Folks,

 Now that TIKA-1581 (JHighlight licensing issues) is resolved, we need to
 release a new version of Tika. I'll volunteer to be the release manager
 again.

 Should we release this as 1.8 or 1.7.1?

 Does anyone have any last minute issues they'd like to finish and see in
 Tika 1.X? I'd like to get the example working with CORS (TIKA-1585 and
 TIKA-1586). Any others?

 Have a good weekend,
 Tyler

--
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions  training
Hadoop, Cascading, Cassandra  Solr

[jira] [Commented] (TIKA-1579) Add file type to NetCDFParser

2015-03-28 Thread Ann Burgess (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385383#comment-14385383
 ] 

Ann Burgess commented on TIKA-1579:
---

Yes!

On Sat, Mar 28, 2015 at 6:09 AM, Tyler Palsulich (JIRA) j...@apache.org




-- 
--
Ann Bryant Burgess, PhD

Postdoctoral Fellow
Computer Science Department
Viterbi School of Engineering
University of Southern California

Phone:  (585) 738-7549
--


 Add file type to NetCDFParser
 -

 Key: TIKA-1579
 URL: https://issues.apache.org/jira/browse/TIKA-1579
 Project: Tika
  Issue Type: Improvement
  Components: parser
Reporter: Ann Burgess
Assignee: Ann Burgess
 Attachments: TIKA-1579.abburgess.190315.patch.txt


 [~gostep] explains that, there are three versions of NetCDF (classic format, 
 64-bit offset, and netCDF-4/HDF5 format). When opening an existing netCDF 
 file, the netCDF library will transparently detect its format so we do not 
 need to adjust according to the detected format.
 That said, it would be good to know the file type as each can have the .nc 
 extension.  This will add patch with add file type to the metadata.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[DISCUSS] Tika 1.8 or 1.7.1

2015-03-28 Thread Tyler Palsulich

Hi Folks,

Now that TIKA-1581 (JHighlight licensing issues) is resolved, we need to
release a new version of Tika. I'll volunteer to be the release manager
again.

Should we release this as 1.8 or 1.7.1?

Does anyone have any last minute issues they'd like to finish and see in
Tika 1.X? I'd like to get the example working with CORS (TIKA-1585 and
TIKA-1586). Any others?

Have a good weekend,
Tyler

[GitHub] tika pull request: TIKA-1586. Enable CORS requests on Tika server

2015-03-28 Thread tpalsulich

GitHub user tpalsulich opened a pull request:

https://github.com/apache/tika/pull/37

TIKA-1586. Enable CORS requests on Tika server



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/tpalsulich/tika TIKA-1586

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tika/pull/37.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #37


commit c296a0459477f1ba088d96fa6ba3895e3a6b3ac5
Author: Tyler Palsulich tpalsul...@gmail.com
Date:   2015-03-28T15:45:45Z

TIKA-1586. Enable CORS requests on Tika server.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

Re: [DISCUSS] Tika 1.8 or 1.7.1

2015-03-28 Thread Mattmann, Chris A (3980)

Hi Tyler - I would VOTE for 1.8. Given the stuff associated
with releasing (updating the website; sending emails; waiting
periods, etc.) let’s ship all the updates we have too along
with the jhighlight fix.

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++






-Original Message-
From: Tyler Palsulich tpalsul...@apache.org
Reply-To: dev@tika.apache.org dev@tika.apache.org
Date: Saturday, March 28, 2015 at 8:01 AM
To: dev@tika.apache.org dev@tika.apache.org
Subject: [DISCUSS] Tika 1.8 or 1.7.1

Hi Folks,

Now that TIKA-1581 (JHighlight licensing issues) is resolved, we need to
release a new version of Tika. I'll volunteer to be the release manager
again.

Should we release this as 1.8 or 1.7.1?

Does anyone have any last minute issues they'd like to finish and see in
Tika 1.X? I'd like to get the example working with CORS (TIKA-1585 and
TIKA-1586). Any others?

Have a good weekend,
Tyler

[jira] [Commented] (TIKA-1586) Enable CORS on Tika Server

2015-03-28 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385369#comment-14385369
 ] 

ASF GitHub Bot commented on TIKA-1586:
--

GitHub user tpalsulich opened a pull request:

https://github.com/apache/tika/pull/37

TIKA-1586. Enable CORS requests on Tika server



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/tpalsulich/tika TIKA-1586

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tika/pull/37.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #37


commit c296a0459477f1ba088d96fa6ba3895e3a6b3ac5
Author: Tyler Palsulich tpalsul...@gmail.com
Date:   2015-03-28T15:45:45Z

TIKA-1586. Enable CORS requests on Tika server.




 Enable CORS on Tika Server
 --

 Key: TIKA-1586
 URL: https://issues.apache.org/jira/browse/TIKA-1586
 Project: Tika
  Issue Type: New Feature
  Components: server
Reporter: Tyler Palsulich
Assignee: Tyler Palsulich

 Tika Server should allow configuration of CORS requests (for uses like 
 TIKA-1585). See [this example|http://cxf.apache.org/docs/jax-rs-cors.html] 
 from CXF for how to add it.
 The only change from that site is that we will need to add a 
 {{CrossOriginResourceSharingFilter}} as a provider.
 Ideally, this is configurable (limit which resources have CORS, and which 
 origins are allowed). But, I'm not thinking of any general methods of how to 
 do that...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1586) Enable CORS on Tika Server


[ 
https://issues.apache.org/jira/browse/TIKA-1586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385372#comment-14385372
 ] 

Tyler Palsulich commented on TIKA-1586:
---

Can someone take a look at the above PR and make sure I'm not doing anything 
bone-headed? Thanks!

 Enable CORS on Tika Server
 --

 Key: TIKA-1586
 URL: https://issues.apache.org/jira/browse/TIKA-1586
 Project: Tika
  Issue Type: New Feature
  Components: server
Reporter: Tyler Palsulich
Assignee: Tyler Palsulich

 Tika Server should allow configuration of CORS requests (for uses like 
 TIKA-1585). See [this example|http://cxf.apache.org/docs/jax-rs-cors.html] 
 from CXF for how to add it.
 The only change from that site is that we will need to add a 
 {{CrossOriginResourceSharingFilter}} as a provider.
 Ideally, this is configurable (limit which resources have CORS, and which 
 origins are allowed). But, I'm not thinking of any general methods of how to 
 do that...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1584) Tika 1.7 possible regression (nested attachment files not getting parsed)


[ 
https://issues.apache.org/jira/browse/TIKA-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385440#comment-14385440
 ] 

Tim Allison commented on TIKA-1584:
---

Able to attach example triggering doc?  By same behavior do you mean that a 
docx inside a zip is not extracted with -X?

 Tika 1.7 possible regression (nested attachment files not getting parsed)
 -

 Key: TIKA-1584
 URL: https://issues.apache.org/jira/browse/TIKA-1584
 Project: Tika
  Issue Type: Bug
  Components: server
Affects Versions: 1.7
Reporter: Rob Tulloh

 I tried to send this to the tika user list, but got a qmail failure so I am 
 opening a jira to see if I can get help with this.
 There appears to be a change in the behavior of tika since 1.5 (the last 
 version we have used). In 1.5, if we pass a file with content type of rfc822 
 which contains a zip that contains a docx file, the entire content would get 
 recursed and the text returned. In 1.7, tika only unwinds as far as the zip 
 file and ignores the content of the contained docx file. This is causing a 
 regression failure in our search tests because the contents of the docx file 
 are not found when searched for.
  
 We are testing with tika-server if this helps. If we ask the meta service to 
 just characterize the test data, it correctly determines the input is of type 
 rfc822. However, on extract, the contents of the attachment are not extracted 
 as expected.
 curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream  
 http://localhost:9998/meta 2/dev/null | grep Content-Type
 Content-Type,message/rfc822
 curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream  
 http://localhost:9998/tika 2/dev/null | grep docx
 sign.docx   --- this is not expected, need contents of this extracted
 We can easily reproduce this problem with a simple eml file with an 
 attachment. Can someone please comment if this seems like a problem or 
 perhaps we need to change something in our call to get the old behavior?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (TIKA-1584) Tika 1.7 possible regression (nested attachment files not getting parsed)