[jira] [Assigned] (TIKA-1584) Tika 1.7 possible regression (nested attachment files not getting parsed)
[ https://issues.apache.org/jira/browse/TIKA-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison reassigned TIKA-1584: - Assignee: Tim Allison Tika 1.7 possible regression (nested attachment files not getting parsed) - Key: TIKA-1584 URL: https://issues.apache.org/jira/browse/TIKA-1584 Project: Tika Issue Type: Bug Components: server Affects Versions: 1.7 Reporter: Rob Tulloh Assignee: Tim Allison Priority: Blocker I tried to send this to the tika user list, but got a qmail failure so I am opening a jira to see if I can get help with this. There appears to be a change in the behavior of tika since 1.5 (the last version we have used). In 1.5, if we pass a file with content type of rfc822 which contains a zip that contains a docx file, the entire content would get recursed and the text returned. In 1.7, tika only unwinds as far as the zip file and ignores the content of the contained docx file. This is causing a regression failure in our search tests because the contents of the docx file are not found when searched for. We are testing with tika-server if this helps. If we ask the meta service to just characterize the test data, it correctly determines the input is of type rfc822. However, on extract, the contents of the attachment are not extracted as expected. curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream http://localhost:9998/meta 2/dev/null | grep Content-Type Content-Type,message/rfc822 curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream http://localhost:9998/tika 2/dev/null | grep docx sign.docx --- this is not expected, need contents of this extracted We can easily reproduce this problem with a simple eml file with an attachment. Can someone please comment if this seems like a problem or perhaps we need to change something in our call to get the old behavior? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: Review Request 32291: ISATab parsers (preliminary version)
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/32291/#review78159 --- Ship it! I have ran this in production and it works awesome! Thanks Giuseppe! - Chris Mattmann On March 23, 2015, 5:04 p.m., Giuseppe Totaro wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/32291/ --- (Updated March 23, 2015, 5:04 p.m.) Review request for tika and Chris Mattmann. Bugs: TIKA-1580 https://issues.apache.org/jira/browse/TIKA-1580 Repository: tika Description --- ISATab parsers. This preliminary solution provides three parsers, one for each ISA-Tab filetype (Investigation, Study, Assay). Diffs - trunk/tika-bundle/pom.xml 1668683 trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml 1668683 trunk/tika-parsers/pom.xml 1668683 trunk/tika-parsers/src/main/java/org/apache/tika/parser/isatab/ISATabAssayParser.java PRE-CREATION trunk/tika-parsers/src/main/java/org/apache/tika/parser/isatab/ISATabInvestigationParser.java PRE-CREATION trunk/tika-parsers/src/main/java/org/apache/tika/parser/isatab/ISATabStudyParser.java PRE-CREATION trunk/tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser 1668683 trunk/tika-parsers/src/test/java/org/apache/tika/parser/isatab/ISATabAssayParserTest.java PRE-CREATION trunk/tika-parsers/src/test/java/org/apache/tika/parser/isatab/ISATabInvestigationParserTest.java PRE-CREATION trunk/tika-parsers/src/test/java/org/apache/tika/parser/isatab/ISATabStudyParserTest.java PRE-CREATION trunk/tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/a_bii-s-2_metabolite profiling_NMR spectroscopy.txt PRE-CREATION trunk/tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/a_metabolome.txt PRE-CREATION trunk/tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/a_microarray.txt PRE-CREATION trunk/tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/a_proteome.txt PRE-CREATION trunk/tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/a_transcriptome.txt PRE-CREATION trunk/tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/i_investigation.txt PRE-CREATION trunk/tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/s_BII-S-1.txt PRE-CREATION trunk/tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/s_BII-S-2.txt PRE-CREATION Diff: https://reviews.apache.org/r/32291/diff/ Testing --- Tested on sample ISA-Tab files downloaded from http://www.isa-tools.org/format/examples/. Thanks, Giuseppe Totaro
Re: Review Request 32291: ISATab parsers (preliminary version)
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/32291/#review78158 --- Ship it! Ship It! - Chris Mattmann On March 23, 2015, 5:04 p.m., Giuseppe Totaro wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/32291/ --- (Updated March 23, 2015, 5:04 p.m.) Review request for tika and Chris Mattmann. Bugs: TIKA-1580 https://issues.apache.org/jira/browse/TIKA-1580 Repository: tika Description --- ISATab parsers. This preliminary solution provides three parsers, one for each ISA-Tab filetype (Investigation, Study, Assay). Diffs - trunk/tika-bundle/pom.xml 1668683 trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml 1668683 trunk/tika-parsers/pom.xml 1668683 trunk/tika-parsers/src/main/java/org/apache/tika/parser/isatab/ISATabAssayParser.java PRE-CREATION trunk/tika-parsers/src/main/java/org/apache/tika/parser/isatab/ISATabInvestigationParser.java PRE-CREATION trunk/tika-parsers/src/main/java/org/apache/tika/parser/isatab/ISATabStudyParser.java PRE-CREATION trunk/tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser 1668683 trunk/tika-parsers/src/test/java/org/apache/tika/parser/isatab/ISATabAssayParserTest.java PRE-CREATION trunk/tika-parsers/src/test/java/org/apache/tika/parser/isatab/ISATabInvestigationParserTest.java PRE-CREATION trunk/tika-parsers/src/test/java/org/apache/tika/parser/isatab/ISATabStudyParserTest.java PRE-CREATION trunk/tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/a_bii-s-2_metabolite profiling_NMR spectroscopy.txt PRE-CREATION trunk/tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/a_metabolome.txt PRE-CREATION trunk/tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/a_microarray.txt PRE-CREATION trunk/tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/a_proteome.txt PRE-CREATION trunk/tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/a_transcriptome.txt PRE-CREATION trunk/tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/i_investigation.txt PRE-CREATION trunk/tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/s_BII-S-1.txt PRE-CREATION trunk/tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/s_BII-S-2.txt PRE-CREATION Diff: https://reviews.apache.org/r/32291/diff/ Testing --- Tested on sample ISA-Tab files downloaded from http://www.isa-tools.org/format/examples/. Thanks, Giuseppe Totaro
[jira] [Updated] (TIKA-1584) Tika 1.7 possible regression (nested attachment files not getting parsed)
[ https://issues.apache.org/jira/browse/TIKA-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1584: -- Priority: Blocker (was: Major) Tika 1.7 possible regression (nested attachment files not getting parsed) - Key: TIKA-1584 URL: https://issues.apache.org/jira/browse/TIKA-1584 Project: Tika Issue Type: Bug Components: server Affects Versions: 1.7 Reporter: Rob Tulloh Priority: Blocker I tried to send this to the tika user list, but got a qmail failure so I am opening a jira to see if I can get help with this. There appears to be a change in the behavior of tika since 1.5 (the last version we have used). In 1.5, if we pass a file with content type of rfc822 which contains a zip that contains a docx file, the entire content would get recursed and the text returned. In 1.7, tika only unwinds as far as the zip file and ignores the content of the contained docx file. This is causing a regression failure in our search tests because the contents of the docx file are not found when searched for. We are testing with tika-server if this helps. If we ask the meta service to just characterize the test data, it correctly determines the input is of type rfc822. However, on extract, the contents of the attachment are not extracted as expected. curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream http://localhost:9998/meta 2/dev/null | grep Content-Type Content-Type,message/rfc822 curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream http://localhost:9998/tika 2/dev/null | grep docx sign.docx --- this is not expected, need contents of this extracted We can easily reproduce this problem with a simple eml file with an attachment. Can someone please comment if this seems like a problem or perhaps we need to change something in our call to get the old behavior? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1575) Upgrade to PDFBox 1.8.9 when available
[ https://issues.apache.org/jira/browse/TIKA-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385561#comment-14385561 ] Konstantin Gribov commented on TIKA-1575: - What about updating to released pdfbox 1.8.9? Extracting from {{966679.pdf}} (PDFBOX-2261) hangs on trunk Tika w/ 1.8.8. And I see no difference when extracting {{10-814_Appendix B_v3.pdf}}, so form extraction issue seems to be fixed. My questing is related to discuss about releasing Tika 1.8/1.7.1, see dev@ (https://mail-archives.apache.org/mod_mbox/tika-dev/201503.mbox/%3CCAM%3DrFA6vvFV3XqpvSNCrubrVHhVO%3Dq%2BighPMRUkmA9f-fKkSXA%40mail.gmail.com%3E) Upgrade to PDFBox 1.8.9 when available -- Key: TIKA-1575 URL: https://issues.apache.org/jira/browse/TIKA-1575 Project: Tika Issue Type: Improvement Reporter: Tim Allison Priority: Minor Attachments: 005937.pdf.json, 005937_1_8_9-SNAPSHOT.pdf.json, 10-814_Appendix B_v3.pdf, 524276_719128_diffs.zip, PDFBox_1_8_8VPDFBox_1_8_9-SNAPSHOT.xlsx, PDFBox_1_8_8VPDFBox_1_8_9-SNAPSHOT_reports.zip, PDFBox_1_8_8Vs1_8_9_20150316.zip, caught_ex_1_8_9.zip, content_diffs_20150316.xlsx, diffs_1_8_9_multithread_vs_single_thread.xlsx, reports_1_8_9_multithread_vs_single.zip The PDFBox community is about to release 1.8.9. Let's use this issue to track discussions before the release and to track Tika's upgrade to PDFBox 1.8.9 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1584) Tika 1.7 possible regression (nested attachment files not getting parsed)
[ https://issues.apache.org/jira/browse/TIKA-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385477#comment-14385477 ] Tim Allison commented on TIKA-1584: --- IMHO this is major enough for a fix asap. Whether that's 1.7.1 with just this fix or a full cut of trunk as 1.8 is up to all devs. Tika colleagues, what do you think? Tika 1.7 possible regression (nested attachment files not getting parsed) - Key: TIKA-1584 URL: https://issues.apache.org/jira/browse/TIKA-1584 Project: Tika Issue Type: Bug Components: server Affects Versions: 1.7 Reporter: Rob Tulloh Assignee: Tim Allison Priority: Blocker I tried to send this to the tika user list, but got a qmail failure so I am opening a jira to see if I can get help with this. There appears to be a change in the behavior of tika since 1.5 (the last version we have used). In 1.5, if we pass a file with content type of rfc822 which contains a zip that contains a docx file, the entire content would get recursed and the text returned. In 1.7, tika only unwinds as far as the zip file and ignores the content of the contained docx file. This is causing a regression failure in our search tests because the contents of the docx file are not found when searched for. We are testing with tika-server if this helps. If we ask the meta service to just characterize the test data, it correctly determines the input is of type rfc822. However, on extract, the contents of the attachment are not extracted as expected. curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream http://localhost:9998/meta 2/dev/null | grep Content-Type Content-Type,message/rfc822 curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream http://localhost:9998/tika 2/dev/null | grep docx sign.docx --- this is not expected, need contents of this extracted We can easily reproduce this problem with a simple eml file with an attachment. Can someone please comment if this seems like a problem or perhaps we need to change something in our call to get the old behavior? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: [DISCUSS] Tika 1.8 or 1.7.1
Also, I think, we should resolve TIKA-1575 (upgrade to pdfbox 1.8.9) since pdfbox 1.8.8 hangs on some pdf forms. -- Best regards, Konstantin Gribov сб, 28 марта 2015 г. в 23:22, Konstantin Gribov gros...@gmail.com: +1 to releasing 1.8. -- Best regards, Konstantin Gribov сб, 28 марта 2015, 22:25, Tyler Palsulich tpalsul...@apache.org: I'm also leaning toward 1.8. Especially given the newly identified regression in TIKA-1584. Tyler On Mar 28, 2015 11:47 AM, Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov wrote: Hi Tyler - I would VOTE for 1.8. Given the stuff associated with releasing (updating the website; sending emails; waiting periods, etc.) let’s ship all the updates we have too along with the jhighlight fix. Cheers, Chris ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Tyler Palsulich tpalsul...@apache.org Reply-To: dev@tika.apache.org dev@tika.apache.org Date: Saturday, March 28, 2015 at 8:01 AM To: dev@tika.apache.org dev@tika.apache.org Subject: [DISCUSS] Tika 1.8 or 1.7.1 Hi Folks, Now that TIKA-1581 (JHighlight licensing issues) is resolved, we need to release a new version of Tika. I'll volunteer to be the release manager again. Should we release this as 1.8 or 1.7.1? Does anyone have any last minute issues they'd like to finish and see in Tika 1.X? I'd like to get the example working with CORS (TIKA-1585 and TIKA-1586). Any others? Have a good weekend, Tyler
Re: [DISCUSS] Tika 1.8 or 1.7.1
Once we fix TIKA-1584, I don't have a preference. I defer to Chris's experience (so I guess, +1 for 1.8) given the amount of work required. It'd be great if we could make sure we aren't bundling any pdfs in our tika-app jar, too. Many apologies if that's been fixed! From: Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov Sent: Saturday, March 28, 2015 11:41 AM To: dev@tika.apache.org Subject: Re: [DISCUSS] Tika 1.8 or 1.7.1 Hi Tyler - I would VOTE for 1.8. Given the stuff associated with releasing (updating the website; sending emails; waiting periods, etc.) let’s ship all the updates we have too along with the jhighlight fix. Cheers, Chris ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Tyler Palsulich tpalsul...@apache.org Reply-To: dev@tika.apache.org dev@tika.apache.org Date: Saturday, March 28, 2015 at 8:01 AM To: dev@tika.apache.org dev@tika.apache.org Subject: [DISCUSS] Tika 1.8 or 1.7.1 Hi Folks, Now that TIKA-1581 (JHighlight licensing issues) is resolved, we need to release a new version of Tika. I'll volunteer to be the release manager again. Should we release this as 1.8 or 1.7.1? Does anyone have any last minute issues they'd like to finish and see in Tika 1.X? I'd like to get the example working with CORS (TIKA-1585 and TIKA-1586). Any others? Have a good weekend, Tyler
[jira] [Commented] (TIKA-1584) Tika 1.7 possible regression (nested attachment files not getting parsed)
[ https://issues.apache.org/jira/browse/TIKA-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385494#comment-14385494 ] Rob Tulloh commented on TIKA-1584: -- I would vote for a release as we have been waiting for tika-1371and were hoping to upgrade to 1.7 Tika 1.7 possible regression (nested attachment files not getting parsed) - Key: TIKA-1584 URL: https://issues.apache.org/jira/browse/TIKA-1584 Project: Tika Issue Type: Bug Components: server Affects Versions: 1.7 Reporter: Rob Tulloh Assignee: Tim Allison Priority: Blocker I tried to send this to the tika user list, but got a qmail failure so I am opening a jira to see if I can get help with this. There appears to be a change in the behavior of tika since 1.5 (the last version we have used). In 1.5, if we pass a file with content type of rfc822 which contains a zip that contains a docx file, the entire content would get recursed and the text returned. In 1.7, tika only unwinds as far as the zip file and ignores the content of the contained docx file. This is causing a regression failure in our search tests because the contents of the docx file are not found when searched for. We are testing with tika-server if this helps. If we ask the meta service to just characterize the test data, it correctly determines the input is of type rfc822. However, on extract, the contents of the attachment are not extracted as expected. curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream http://localhost:9998/meta 2/dev/null | grep Content-Type Content-Type,message/rfc822 curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream http://localhost:9998/tika 2/dev/null | grep docx sign.docx --- this is not expected, need contents of this extracted We can easily reproduce this problem with a simple eml file with an attachment. Can someone please comment if this seems like a problem or perhaps we need to change something in our call to get the old behavior? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: [DISCUSS] Tika 1.8 or 1.7.1
I'm also leaning toward 1.8. Especially given the newly identified regression in TIKA-1584. Tyler On Mar 28, 2015 11:47 AM, Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov wrote: Hi Tyler - I would VOTE for 1.8. Given the stuff associated with releasing (updating the website; sending emails; waiting periods, etc.) let’s ship all the updates we have too along with the jhighlight fix. Cheers, Chris ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Tyler Palsulich tpalsul...@apache.org Reply-To: dev@tika.apache.org dev@tika.apache.org Date: Saturday, March 28, 2015 at 8:01 AM To: dev@tika.apache.org dev@tika.apache.org Subject: [DISCUSS] Tika 1.8 or 1.7.1 Hi Folks, Now that TIKA-1581 (JHighlight licensing issues) is resolved, we need to release a new version of Tika. I'll volunteer to be the release manager again. Should we release this as 1.8 or 1.7.1? Does anyone have any last minute issues they'd like to finish and see in Tika 1.X? I'd like to get the example working with CORS (TIKA-1585 and TIKA-1586). Any others? Have a good weekend, Tyler
[jira] [Commented] (TIKA-1580) ISA-Tab parsers
[ https://issues.apache.org/jira/browse/TIKA-1580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385541#comment-14385541 ] Chris A. Mattmann commented on TIKA-1580: - Committed in r1669839. Thank you [~gostep] you did amazing on this! {noformat} [chipotle:~/tmp/tika] mattmann% svn commit -m Fix for TIKA-1580: Support IsaTab MIME identification and parsing. Thanks to Giuseppe Totaro for all the great work! SendingCHANGES.txt Sendingtika-bundle/pom.xml Sending tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml Sendingtika-parsers/pom.xml Adding tika-parsers/src/main/java/org/apache/tika/parser/isatab Adding tika-parsers/src/main/java/org/apache/tika/parser/isatab/ISATabUtils.java Adding tika-parsers/src/main/java/org/apache/tika/parser/isatab/ISArchiveParser.java Sending tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser Adding tika-parsers/src/test/java/org/apache/tika/parser/isatab Adding tika-parsers/src/test/java/org/apache/tika/parser/isatab/ISArchiveParserTest.java Adding tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1 Adding tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/a_bii-s-2_metabolite profiling_NMR spectroscopy.txt Adding tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/a_metabolome.txt Adding tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/a_microarray.txt Adding tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/a_proteome.txt Adding tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/a_transcriptome.txt Adding tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/i_investigation.txt Adding tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/s_BII-S-1.txt Adding tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/s_BII-S-2.txt Transmitting file data Committed revision 1669839. [chipotle:~/tmp/tika] mattmann% {noformat} ISA-Tab parsers --- Key: TIKA-1580 URL: https://issues.apache.org/jira/browse/TIKA-1580 Project: Tika Issue Type: New Feature Components: parser Reporter: Giuseppe Totaro Assignee: Chris A. Mattmann Priority: Minor Labels: new-parser Fix For: 1.8 Attachments: TIKA-1580.Mattmann.Totaro.032515.patch.txt, TIKA-1580.patch, TIKA-1580.v02.patch, TIKA-1580.v03.2.Mattmann.Totaro.03262015.patch We are going to add parsers for ISA-Tab data formats. ISA-Tab files are related to [ISA Tools|http://www.isa-tools.org/] which help to manage an increasingly diverse set of life science, environmental and biomedical experiments that employing one or a combination of technologies. The ISA tools are built upon _Investigation_, _Study_, and _Assay_ tabular format. Therefore, ISA-Tab data format includes three types of file: Investigation file ({{a_.txt}}), Study file ({{s_.txt}}), Assay file ({{a_.txt}}). These files are organized as [top-down hierarchy|http://www.isa-tools.org/format/specification/]: An Investigation file includes one or more Study files: each Study files includes one or more Assay files. Essentially, the Investigation files contains high-level information about the related study, so it provides only metadata about ISA-Tab files. More details on file format specification are [available online|http://isatab.sourceforge.net/docs/ISA-TAB_release-candidate-1_v1.0_24nov08.pdf]. The patch in attachment provides a preliminary version of ISA-Tab parsers (there are three parsers; one parser for each ISA-Tab filetype): * {{ISATabInvestigationParser.java}}: parses Investigation files. It extracts only metadata. * {{ISATabStudyParser.java}}: parses Study files. * {{ISATabAssayParser.java}}: parses Assay files. The most important improvements are: * Combine these three parsers in order to parse an ISArchive * Provide a better mapping of both study and assay data on XHML. Currently, {{ISATabStudyParser}} and {{ISATabAssayParser}} provide a naive mapping function relying on [Apache Commons CSV|https://commons.apache.org/proper/commons-csv/]. Thanks for supporting me on this work [~chrismattmann]. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TIKA-1580) ISA-Tab parsers
[ https://issues.apache.org/jira/browse/TIKA-1580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann resolved TIKA-1580. - Resolution: Fixed ISA-Tab parsers --- Key: TIKA-1580 URL: https://issues.apache.org/jira/browse/TIKA-1580 Project: Tika Issue Type: New Feature Components: parser Reporter: Giuseppe Totaro Assignee: Chris A. Mattmann Priority: Minor Labels: new-parser Fix For: 1.8 Attachments: TIKA-1580.Mattmann.Totaro.032515.patch.txt, TIKA-1580.patch, TIKA-1580.v02.patch, TIKA-1580.v03.2.Mattmann.Totaro.03262015.patch We are going to add parsers for ISA-Tab data formats. ISA-Tab files are related to [ISA Tools|http://www.isa-tools.org/] which help to manage an increasingly diverse set of life science, environmental and biomedical experiments that employing one or a combination of technologies. The ISA tools are built upon _Investigation_, _Study_, and _Assay_ tabular format. Therefore, ISA-Tab data format includes three types of file: Investigation file ({{a_.txt}}), Study file ({{s_.txt}}), Assay file ({{a_.txt}}). These files are organized as [top-down hierarchy|http://www.isa-tools.org/format/specification/]: An Investigation file includes one or more Study files: each Study files includes one or more Assay files. Essentially, the Investigation files contains high-level information about the related study, so it provides only metadata about ISA-Tab files. More details on file format specification are [available online|http://isatab.sourceforge.net/docs/ISA-TAB_release-candidate-1_v1.0_24nov08.pdf]. The patch in attachment provides a preliminary version of ISA-Tab parsers (there are three parsers; one parser for each ISA-Tab filetype): * {{ISATabInvestigationParser.java}}: parses Investigation files. It extracts only metadata. * {{ISATabStudyParser.java}}: parses Study files. * {{ISATabAssayParser.java}}: parses Assay files. The most important improvements are: * Combine these three parsers in order to parse an ISArchive * Provide a better mapping of both study and assay data on XHML. Currently, {{ISATabStudyParser}} and {{ISATabAssayParser}} provide a naive mapping function relying on [Apache Commons CSV|https://commons.apache.org/proper/commons-csv/]. Thanks for supporting me on this work [~chrismattmann]. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-1584) Tika 1.7 possible regression (nested attachment files not getting parsed)
[ https://issues.apache.org/jira/browse/TIKA-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385463#comment-14385463 ] Tim Allison edited comment on TIKA-1584 at 3/28/15 6:53 PM: Just checked svn. That's a major regression added in 1.7 when we added specification of ParseContext. We need to add the Parser to the ParseContext to get recursive parsing. W/o use of ParseContext in call to parse, the parser used to work recursively. Will fix Monday unless someone beats me to it. Thank you for raising this. No need to attach test doc. was (Author: talli...@mitre.org): Just checked svn. That's a major regression added in 1.7 when we added specification of ParseContext. We need to add the Parser to the ParseContext to get recursive parsing. W/o use of ParseContext in call to parse, the parser works recursively. Will fix Monday unless someone beats me to it. Thank you for raising this. No need to attach test doc. Tika 1.7 possible regression (nested attachment files not getting parsed) - Key: TIKA-1584 URL: https://issues.apache.org/jira/browse/TIKA-1584 Project: Tika Issue Type: Bug Components: server Affects Versions: 1.7 Reporter: Rob Tulloh Priority: Blocker I tried to send this to the tika user list, but got a qmail failure so I am opening a jira to see if I can get help with this. There appears to be a change in the behavior of tika since 1.5 (the last version we have used). In 1.5, if we pass a file with content type of rfc822 which contains a zip that contains a docx file, the entire content would get recursed and the text returned. In 1.7, tika only unwinds as far as the zip file and ignores the content of the contained docx file. This is causing a regression failure in our search tests because the contents of the docx file are not found when searched for. We are testing with tika-server if this helps. If we ask the meta service to just characterize the test data, it correctly determines the input is of type rfc822. However, on extract, the contents of the attachment are not extracted as expected. curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream http://localhost:9998/meta 2/dev/null | grep Content-Type Content-Type,message/rfc822 curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream http://localhost:9998/tika 2/dev/null | grep docx sign.docx --- this is not expected, need contents of this extracted We can easily reproduce this problem with a simple eml file with an attachment. Can someone please comment if this seems like a problem or perhaps we need to change something in our call to get the old behavior? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1584) Tika 1.7 possible regression (nested attachment files not getting parsed)
[ https://issues.apache.org/jira/browse/TIKA-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385463#comment-14385463 ] Tim Allison commented on TIKA-1584: --- Just checked svn. That's a major regression added in 1.7 when we added specification of ParseContext. We need to add the Parser to the ParseContext to get recursive parsing. W/o use of ParseContext in call to parse, the parser works recursively. Will fix Monday unless someone beats me to it. Thank you for raising this. No need to attach test doc. Tika 1.7 possible regression (nested attachment files not getting parsed) - Key: TIKA-1584 URL: https://issues.apache.org/jira/browse/TIKA-1584 Project: Tika Issue Type: Bug Components: server Affects Versions: 1.7 Reporter: Rob Tulloh I tried to send this to the tika user list, but got a qmail failure so I am opening a jira to see if I can get help with this. There appears to be a change in the behavior of tika since 1.5 (the last version we have used). In 1.5, if we pass a file with content type of rfc822 which contains a zip that contains a docx file, the entire content would get recursed and the text returned. In 1.7, tika only unwinds as far as the zip file and ignores the content of the contained docx file. This is causing a regression failure in our search tests because the contents of the docx file are not found when searched for. We are testing with tika-server if this helps. If we ask the meta service to just characterize the test data, it correctly determines the input is of type rfc822. However, on extract, the contents of the attachment are not extracted as expected. curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream http://localhost:9998/meta 2/dev/null | grep Content-Type Content-Type,message/rfc822 curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream http://localhost:9998/tika 2/dev/null | grep docx sign.docx --- this is not expected, need contents of this extracted We can easily reproduce this problem with a simple eml file with an attachment. Can someone please comment if this seems like a problem or perhaps we need to change something in our call to get the old behavior? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1584) Tika 1.7 possible regression (nested attachment files not getting parsed)
[ https://issues.apache.org/jira/browse/TIKA-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385483#comment-14385483 ] Tyler Palsulich commented on TIKA-1584: --- We now have two major issues which need a quick release. So, I would say go for 1.8. Tim, can you chime in on the current discuss thread? Tika 1.7 possible regression (nested attachment files not getting parsed) - Key: TIKA-1584 URL: https://issues.apache.org/jira/browse/TIKA-1584 Project: Tika Issue Type: Bug Components: server Affects Versions: 1.7 Reporter: Rob Tulloh Assignee: Tim Allison Priority: Blocker I tried to send this to the tika user list, but got a qmail failure so I am opening a jira to see if I can get help with this. There appears to be a change in the behavior of tika since 1.5 (the last version we have used). In 1.5, if we pass a file with content type of rfc822 which contains a zip that contains a docx file, the entire content would get recursed and the text returned. In 1.7, tika only unwinds as far as the zip file and ignores the content of the contained docx file. This is causing a regression failure in our search tests because the contents of the docx file are not found when searched for. We are testing with tika-server if this helps. If we ask the meta service to just characterize the test data, it correctly determines the input is of type rfc822. However, on extract, the contents of the attachment are not extracted as expected. curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream http://localhost:9998/meta 2/dev/null | grep Content-Type Content-Type,message/rfc822 curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream http://localhost:9998/tika 2/dev/null | grep docx sign.docx --- this is not expected, need contents of this extracted We can easily reproduce this problem with a simple eml file with an attachment. Can someone please comment if this seems like a problem or perhaps we need to change something in our call to get the old behavior? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1580) ISA-Tab parsers
[ https://issues.apache.org/jira/browse/TIKA-1580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385558#comment-14385558 ] Hudson commented on TIKA-1580: -- SUCCESS: Integrated in tika-trunk-jdk1.7 #579 (See [https://builds.apache.org/job/tika-trunk-jdk1.7/579/]) Fix for TIKA-1580: Support IsaTab MIME identification and parsing. Thanks to Giuseppe Totaro for all the great work! (mattmann: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1669839) * /tika/trunk/CHANGES.txt * /tika/trunk/tika-bundle/pom.xml * /tika/trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml * /tika/trunk/tika-parsers/pom.xml * /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/isatab * /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/isatab/ISATabUtils.java * /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/isatab/ISArchiveParser.java * /tika/trunk/tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser * /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/isatab * /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/isatab/ISArchiveParserTest.java * /tika/trunk/tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1 * /tika/trunk/tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/a_bii-s-2_metabolite profiling_NMR spectroscopy.txt * /tika/trunk/tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/a_metabolome.txt * /tika/trunk/tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/a_microarray.txt * /tika/trunk/tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/a_proteome.txt * /tika/trunk/tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/a_transcriptome.txt * /tika/trunk/tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/i_investigation.txt * /tika/trunk/tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/s_BII-S-1.txt * /tika/trunk/tika-parsers/src/test/resources/test-documents/testISATab_BII-I-1/s_BII-S-2.txt ISA-Tab parsers --- Key: TIKA-1580 URL: https://issues.apache.org/jira/browse/TIKA-1580 Project: Tika Issue Type: New Feature Components: parser Reporter: Giuseppe Totaro Assignee: Chris A. Mattmann Priority: Minor Labels: new-parser Fix For: 1.8 Attachments: TIKA-1580.Mattmann.Totaro.032515.patch.txt, TIKA-1580.patch, TIKA-1580.v02.patch, TIKA-1580.v03.2.Mattmann.Totaro.03262015.patch We are going to add parsers for ISA-Tab data formats. ISA-Tab files are related to [ISA Tools|http://www.isa-tools.org/] which help to manage an increasingly diverse set of life science, environmental and biomedical experiments that employing one or a combination of technologies. The ISA tools are built upon _Investigation_, _Study_, and _Assay_ tabular format. Therefore, ISA-Tab data format includes three types of file: Investigation file ({{a_.txt}}), Study file ({{s_.txt}}), Assay file ({{a_.txt}}). These files are organized as [top-down hierarchy|http://www.isa-tools.org/format/specification/]: An Investigation file includes one or more Study files: each Study files includes one or more Assay files. Essentially, the Investigation files contains high-level information about the related study, so it provides only metadata about ISA-Tab files. More details on file format specification are [available online|http://isatab.sourceforge.net/docs/ISA-TAB_release-candidate-1_v1.0_24nov08.pdf]. The patch in attachment provides a preliminary version of ISA-Tab parsers (there are three parsers; one parser for each ISA-Tab filetype): * {{ISATabInvestigationParser.java}}: parses Investigation files. It extracts only metadata. * {{ISATabStudyParser.java}}: parses Study files. * {{ISATabAssayParser.java}}: parses Assay files. The most important improvements are: * Combine these three parsers in order to parse an ISArchive * Provide a better mapping of both study and assay data on XHML. Currently, {{ISATabStudyParser}} and {{ISATabAssayParser}} provide a naive mapping function relying on [Apache Commons CSV|https://commons.apache.org/proper/commons-csv/]. Thanks for supporting me on this work [~chrismattmann]. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1584) Tika 1.7 possible regression (nested attachment files not getting parsed)
[ https://issues.apache.org/jira/browse/TIKA-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385472#comment-14385472 ] Rob Tulloh commented on TIKA-1584: -- Thank you. For what it's worth, it easy to reproduce. Just zip any document you want and then pass the zip file to tika server and see what it gives back. As 1.7 is released, does this mean that this won't be fixed until 1.8 or would 1.7 get re-released/patched? Tika 1.7 possible regression (nested attachment files not getting parsed) - Key: TIKA-1584 URL: https://issues.apache.org/jira/browse/TIKA-1584 Project: Tika Issue Type: Bug Components: server Affects Versions: 1.7 Reporter: Rob Tulloh Priority: Blocker I tried to send this to the tika user list, but got a qmail failure so I am opening a jira to see if I can get help with this. There appears to be a change in the behavior of tika since 1.5 (the last version we have used). In 1.5, if we pass a file with content type of rfc822 which contains a zip that contains a docx file, the entire content would get recursed and the text returned. In 1.7, tika only unwinds as far as the zip file and ignores the content of the contained docx file. This is causing a regression failure in our search tests because the contents of the docx file are not found when searched for. We are testing with tika-server if this helps. If we ask the meta service to just characterize the test data, it correctly determines the input is of type rfc822. However, on extract, the contents of the attachment are not extracted as expected. curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream http://localhost:9998/meta 2/dev/null | grep Content-Type Content-Type,message/rfc822 curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream http://localhost:9998/tika 2/dev/null | grep docx sign.docx --- this is not expected, need contents of this extracted We can easily reproduce this problem with a simple eml file with an attachment. Can someone please comment if this seems like a problem or perhaps we need to change something in our call to get the old behavior? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1580) ISA-Tab parsers
[ https://issues.apache.org/jira/browse/TIKA-1580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385540#comment-14385540 ] Chris A. Mattmann commented on TIKA-1580: - Built and unit tested successfully. Deployed in production on bioinformatics project. Works great! Committing now. {noformat} [INFO] [INFO] Reactor Summary: [INFO] [INFO] Apache Tika parent . SUCCESS [ 1.994 s] [INFO] Apache Tika core ... SUCCESS [ 19.033 s] [INFO] Apache Tika parsers SUCCESS [02:29 min] [INFO] Apache Tika XMP SUCCESS [ 2.675 s] [INFO] Apache Tika serialization .. SUCCESS [ 2.082 s] [INFO] Apache Tika batch .. SUCCESS [01:57 min] [INFO] Apache Tika application SUCCESS [ 13.227 s] [INFO] Apache Tika OSGi bundle SUCCESS [ 18.103 s] [INFO] Apache Tika server . SUCCESS [ 22.457 s] [INFO] Apache Tika translate .. SUCCESS [ 3.347 s] [INFO] Apache Tika examples ... SUCCESS [ 6.103 s] [INFO] Apache Tika Java-7 Components .. SUCCESS [ 2.039 s] [INFO] Apache Tika SUCCESS [ 0.030 s] [INFO] [INFO] BUILD SUCCESS [INFO] [INFO] Total time: 05:58 min [INFO] Finished at: 2015-03-28T14:45:20-07:00 [INFO] Final Memory: 103M/1592M [INFO] [chipotle:~/tmp/tika] mattmann% {noformat} ISA-Tab parsers --- Key: TIKA-1580 URL: https://issues.apache.org/jira/browse/TIKA-1580 Project: Tika Issue Type: New Feature Components: parser Reporter: Giuseppe Totaro Assignee: Chris A. Mattmann Priority: Minor Labels: new-parser Fix For: 1.8 Attachments: TIKA-1580.Mattmann.Totaro.032515.patch.txt, TIKA-1580.patch, TIKA-1580.v02.patch, TIKA-1580.v03.2.Mattmann.Totaro.03262015.patch We are going to add parsers for ISA-Tab data formats. ISA-Tab files are related to [ISA Tools|http://www.isa-tools.org/] which help to manage an increasingly diverse set of life science, environmental and biomedical experiments that employing one or a combination of technologies. The ISA tools are built upon _Investigation_, _Study_, and _Assay_ tabular format. Therefore, ISA-Tab data format includes three types of file: Investigation file ({{a_.txt}}), Study file ({{s_.txt}}), Assay file ({{a_.txt}}). These files are organized as [top-down hierarchy|http://www.isa-tools.org/format/specification/]: An Investigation file includes one or more Study files: each Study files includes one or more Assay files. Essentially, the Investigation files contains high-level information about the related study, so it provides only metadata about ISA-Tab files. More details on file format specification are [available online|http://isatab.sourceforge.net/docs/ISA-TAB_release-candidate-1_v1.0_24nov08.pdf]. The patch in attachment provides a preliminary version of ISA-Tab parsers (there are three parsers; one parser for each ISA-Tab filetype): * {{ISATabInvestigationParser.java}}: parses Investigation files. It extracts only metadata. * {{ISATabStudyParser.java}}: parses Study files. * {{ISATabAssayParser.java}}: parses Assay files. The most important improvements are: * Combine these three parsers in order to parse an ISArchive * Provide a better mapping of both study and assay data on XHML. Currently, {{ISATabStudyParser}} and {{ISATabAssayParser}} provide a naive mapping function relying on [Apache Commons CSV|https://commons.apache.org/proper/commons-csv/]. Thanks for supporting me on this work [~chrismattmann]. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-1584) Tika 1.7 possible regression (nested attachment files not getting parsed)
Rob Tulloh created TIKA-1584: Summary: Tika 1.7 possible regression (nested attachment files not getting parsed) Key: TIKA-1584 URL: https://issues.apache.org/jira/browse/TIKA-1584 Project: Tika Issue Type: Bug Components: server Affects Versions: 1.7 Reporter: Rob Tulloh I tried to send this to the tika user list, but got a qmail failure so I am opening a jira to see if I can get help with this. There appears to be a change in the behavior of tika since 1.5 (the last version we have used). In 1.5, if we pass a file with content type of rfc822 which contains a zip that contains a docx file, the entire content would get recursed and the text returned. In 1.7, tika only unwinds as far as the zip file and ignores the content of the contained docx file. This is causing a regression failure in our search tests because the contents of the docx file are not found when searched for. We are testing with tika-server if this helps. If we ask the meta service to just characterize the test data, it correctly determines the input is of type rfc822. However, on extract, the contents of the attachment are not extracted as expected. curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream http://localhost:9998/meta 2/dev/null | grep Content-Type Content-Type,message/rfc822 curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream http://localhost:9998/tika 2/dev/null | grep docx sign.docx --- this is not expected, need contents of this extracted We can easily reproduce this problem with a simple eml file with an attachment. Can someone please comment if this seems like a problem or perhaps we need to change something in our call to get the old behavior? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-1585) Create Example Website with Form Submission
Tyler Palsulich created TIKA-1585: - Summary: Create Example Website with Form Submission Key: TIKA-1585 URL: https://issues.apache.org/jira/browse/TIKA-1585 Project: Tika Issue Type: New Feature Components: example, server Reporter: Tyler Palsulich Assignee: Tyler Palsulich It would be great to have a website where we can direct people who ask what Tika can do for [filetype] without needing them to actually download Tika. Some initial work to do that is [here|http://tpalsulich.github.io/TikaExamples/]. I'm far from a design guru, but I imagine the site as having a form where you can upload a file at the top, checkboxes for if you want metadata, content, or both, and a submit button. The request should be sent with AJAX and the result should populate a {{div}}. One issue with AJAX requests is that Tika Server doesn't currently allow Cross-Origin-Resource-Sharing (CORS). So, we either need to maintain a slightly updated tika-server, or update the server to allow configuration. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1584) Tika 1.7 possible regression (nested attachment files not getting parsed)
[ https://issues.apache.org/jira/browse/TIKA-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385328#comment-14385328 ] Rob Tulloh commented on TIKA-1584: -- If the .zip file is passed to tika, it shows the same behavior. {noformat} curl -X PUT -T sign.zip -H Content-Type:application/octet-stream http://localhost:9998/tika 2/dev/null sign.docx {noformat} Tika 1.7 possible regression (nested attachment files not getting parsed) - Key: TIKA-1584 URL: https://issues.apache.org/jira/browse/TIKA-1584 Project: Tika Issue Type: Bug Components: server Affects Versions: 1.7 Reporter: Rob Tulloh I tried to send this to the tika user list, but got a qmail failure so I am opening a jira to see if I can get help with this. There appears to be a change in the behavior of tika since 1.5 (the last version we have used). In 1.5, if we pass a file with content type of rfc822 which contains a zip that contains a docx file, the entire content would get recursed and the text returned. In 1.7, tika only unwinds as far as the zip file and ignores the content of the contained docx file. This is causing a regression failure in our search tests because the contents of the docx file are not found when searched for. We are testing with tika-server if this helps. If we ask the meta service to just characterize the test data, it correctly determines the input is of type rfc822. However, on extract, the contents of the attachment are not extracted as expected. curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream http://localhost:9998/meta 2/dev/null | grep Content-Type Content-Type,message/rfc822 curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream http://localhost:9998/tika 2/dev/null | grep docx sign.docx --- this is not expected, need contents of this extracted We can easily reproduce this problem with a simple eml file with an attachment. Can someone please comment if this seems like a problem or perhaps we need to change something in our call to get the old behavior? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1581) jhighlight license concerns
[ https://issues.apache.org/jira/browse/TIKA-1581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385331#comment-14385331 ] Ken Krugler commented on TIKA-1581: --- Hi Tyler, JHighlight has been updated in Central, and Tika is now using that version. So I believe it's resolved, as long as the changes that changes that Hong-Thai made to the NOTICE.txt are sufficient for the CDDL license used by jhighlight. And yes, as per my comment above we'll need to release a new version of Tika for downstream libraries. Seems like it could be worth a quick dot release for ManifoldCF/Lucene. jhighlight license concerns --- Key: TIKA-1581 URL: https://issues.apache.org/jira/browse/TIKA-1581 Project: Tika Issue Type: Bug Affects Versions: 1.7 Reporter: Karl Wright Fix For: 1.8 jhighlight jar is a Tika dependency. The Lucene team discovered that, while it claims to be a CDDL/LGPL dual-license, some of its functionality is LGPL only: {code} Solr's contrib/extraction contains jhighlight-1.0.jar which declares itself as dual CDDL or LGPL license. However, some of its classes are distributed only under LGPL, e.g. com.uwyn.jhighlight.highlighter. CppHighlighter.java GroovyHighlighter.java JavaHighlighter.java XmlHighlighter.java I downloaded the sources from Maven (http://search.maven.org/remotecontent?filepath=com/uwyn/jhighlight/1.0/jhighlight-1.0-sources.jar) to confirm that, and also found this SVN repo: http://svn.rifers.org/jhighlight/tags/release-1.0, though the project's website seems to not exist anymore (https://jhighlight.dev.java.net/). I didn't find any direct usage of it in our code, so I guess it's probably needed by a 3rd party dependency, such as Tika. Therefore if we e.g. omit it, things will compile, but may fail at runtime. {code} Is it possible to remove this dependency for future releases, or allow only optional inclusion of this package? It is of concern to the ManifoldCF project because we distribute a binary package that includes Tika and its required dependencies, which currently includes jHighlight. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1581) jhighlight license concerns
[ https://issues.apache.org/jira/browse/TIKA-1581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385346#comment-14385346 ] Ken Krugler commented on TIKA-1581: --- Based on what I see in other projects (e.g. the Lucene NOTICE.txt file) this seems to be following standard practices, so I'm going to assume it's OK. jhighlight license concerns --- Key: TIKA-1581 URL: https://issues.apache.org/jira/browse/TIKA-1581 Project: Tika Issue Type: Bug Affects Versions: 1.7 Reporter: Karl Wright Fix For: 1.8 jhighlight jar is a Tika dependency. The Lucene team discovered that, while it claims to be a CDDL/LGPL dual-license, some of its functionality is LGPL only: {code} Solr's contrib/extraction contains jhighlight-1.0.jar which declares itself as dual CDDL or LGPL license. However, some of its classes are distributed only under LGPL, e.g. com.uwyn.jhighlight.highlighter. CppHighlighter.java GroovyHighlighter.java JavaHighlighter.java XmlHighlighter.java I downloaded the sources from Maven (http://search.maven.org/remotecontent?filepath=com/uwyn/jhighlight/1.0/jhighlight-1.0-sources.jar) to confirm that, and also found this SVN repo: http://svn.rifers.org/jhighlight/tags/release-1.0, though the project's website seems to not exist anymore (https://jhighlight.dev.java.net/). I didn't find any direct usage of it in our code, so I guess it's probably needed by a 3rd party dependency, such as Tika. Therefore if we e.g. omit it, things will compile, but may fail at runtime. {code} Is it possible to remove this dependency for future releases, or allow only optional inclusion of this package? It is of concern to the ManifoldCF project because we distribute a binary package that includes Tika and its required dependencies, which currently includes jHighlight. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TIKA-1526) ExternalParser should trap/ignore/workarround JDK-8047340 JDK-8055301 so Turkish Tika users can still use non-external parsers
[ https://issues.apache.org/jira/browse/TIKA-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich resolved TIKA-1526. --- Resolution: Fixed Marking this as Fixed, per the above comments. [~thetaphi] or [~hossman] or anyone else, please reopen this if you find any other cases. Thank you everyone for the help! ExternalParser should trap/ignore/workarround JDK-8047340 JDK-8055301 so Turkish Tika users can still use non-external parsers Key: TIKA-1526 URL: https://issues.apache.org/jira/browse/TIKA-1526 Project: Tika Issue Type: Wish Reporter: Hoss Man the JDK has numerous pain points regarding the Turkish locale, posix_spawn lowercasing being one of them... https://bugs.openjdk.java.net/browse/JDK-8047340 https://bugs.openjdk.java.net/browse/JDK-8055301 As of Tika 1.7, the TesseractOCRParser (which is an ExternalParser) is enabled configured by default in Tika, and uses ExternalParser.check to see if tesseract is available -- but because of the JDK bug, this means that Tika fails fast for Turkish users on BSD/UNIX variants (including MacOSX) like so... {noformat} [junit4] Throwable #1: java.lang.Error: posix_spawn is not a supported process launch mechanism on this platform. [junit4] at java.lang.UNIXProcess$1.run(UNIXProcess.java:105) [junit4] at java.lang.UNIXProcess$1.run(UNIXProcess.java:94) [junit4] at java.security.AccessController.doPrivileged(Native Method) [junit4] at java.lang.UNIXProcess.clinit(UNIXProcess.java:92) [junit4] at java.lang.ProcessImpl.start(ProcessImpl.java:130) [junit4] at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029) [junit4] at java.lang.Runtime.exec(Runtime.java:620) [junit4] at java.lang.Runtime.exec(Runtime.java:485) [junit4] at org.apache.tika.parser.external.ExternalParser.check(ExternalParser.java:344) [junit4] at org.apache.tika.parser.ocr.TesseractOCRParser.hasTesseract(TesseractOCRParser.java:117) [junit4] at org.apache.tika.parser.ocr.TesseractOCRParser.getSupportedTypes(TesseractOCRParser.java:90) [junit4] at org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81) [junit4] at org.apache.tika.parser.DefaultParser.getParsers(DefaultParser.java:95) [junit4] at org.apache.tika.parser.CompositeParser.getSupportedTypes(CompositeParser.java:229) [junit4] at org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81) [junit4] at org.apache.tika.parser.CompositeParser.getParser(CompositeParser.java:209) [junit4] at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) [junit4] at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) {noformat} ...unless they go out of their way to white list only the parsers they need/want so TesseractOCRParser (and any other ExternalParsers) will never even be check()ed. It would be nice if Tika's ExternalParser class added a similar hack/workarround to what was done in SOLR-6387 to trap these types of errors. In Solr we just propogate a better error explaining why Java hates the turkish langauge... {code} } catch (Error err) { if (err.getMessage() != null (err.getMessage().contains(posix_spawn) || err.getMessage().contains(UNIXProcess))) { log.warn(Error forking command due to JVM locale bug (see https://issues.apache.org/jira/browse/SOLR-6387): + err.getMessage()); return (error executing: + cmd + ); } } {code} ...but with Tika, it might be better for all ExternalParsers to just opt out as if they don't recognize the filetype when they detect this type of error fro m the check method (or perhaps it would be better if AutoDetectParser handled this? ... i'm not really sure how it would best fit into Tika's architecture) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1581) jhighlight license concerns
[ https://issues.apache.org/jira/browse/TIKA-1581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385337#comment-14385337 ] Tyler Palsulich commented on TIKA-1581: --- Hi [~kkrugler]. Thanks. The comment is now bq. Tika-parsers component uses CDDL/LGPL dual-licensed dependency: jhighlight (https://github.com/codelibs/jhighlight) If this looks good, I'll start a \[DISCUSS\] thread on the list about a new version. jhighlight license concerns --- Key: TIKA-1581 URL: https://issues.apache.org/jira/browse/TIKA-1581 Project: Tika Issue Type: Bug Affects Versions: 1.7 Reporter: Karl Wright Fix For: 1.8 jhighlight jar is a Tika dependency. The Lucene team discovered that, while it claims to be a CDDL/LGPL dual-license, some of its functionality is LGPL only: {code} Solr's contrib/extraction contains jhighlight-1.0.jar which declares itself as dual CDDL or LGPL license. However, some of its classes are distributed only under LGPL, e.g. com.uwyn.jhighlight.highlighter. CppHighlighter.java GroovyHighlighter.java JavaHighlighter.java XmlHighlighter.java I downloaded the sources from Maven (http://search.maven.org/remotecontent?filepath=com/uwyn/jhighlight/1.0/jhighlight-1.0-sources.jar) to confirm that, and also found this SVN repo: http://svn.rifers.org/jhighlight/tags/release-1.0, though the project's website seems to not exist anymore (https://jhighlight.dev.java.net/). I didn't find any direct usage of it in our code, so I guess it's probably needed by a 3rd party dependency, such as Tika. Therefore if we e.g. omit it, things will compile, but may fail at runtime. {code} Is it possible to remove this dependency for future releases, or allow only optional inclusion of this package? It is of concern to the ManifoldCF project because we distribute a binary package that includes Tika and its required dependencies, which currently includes jHighlight. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-1586) Enable CORS on Tika Server
Tyler Palsulich created TIKA-1586: - Summary: Enable CORS on Tika Server Key: TIKA-1586 URL: https://issues.apache.org/jira/browse/TIKA-1586 Project: Tika Issue Type: New Feature Components: server Reporter: Tyler Palsulich Assignee: Tyler Palsulich Tika Server should allow configuration of CORS requests (for uses like TIKA-1585). See [this example|http://cxf.apache.org/docs/jax-rs-cors.html] from CXF for how to add it. The only change from that site is that we will need to add a {{CrossOriginResourceSharingFilter}} as a provider. Ideally, this is configurable (limit which resources have CORS, and which origins are allowed). But, I'm not thinking of any general methods of how to do that... -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] tika pull request: TIKA-1586. Enable CORS requests on Tika server
Github user asfgit closed the pull request at: https://github.com/apache/tika/pull/37 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (TIKA-1586) Enable CORS on Tika Server
[ https://issues.apache.org/jira/browse/TIKA-1586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385398#comment-14385398 ] ASF GitHub Bot commented on TIKA-1586: -- Github user asfgit closed the pull request at: https://github.com/apache/tika/pull/37 Enable CORS on Tika Server -- Key: TIKA-1586 URL: https://issues.apache.org/jira/browse/TIKA-1586 Project: Tika Issue Type: New Feature Components: server Reporter: Tyler Palsulich Assignee: Tyler Palsulich Tika Server should allow configuration of CORS requests (for uses like TIKA-1585). See [this example|http://cxf.apache.org/docs/jax-rs-cors.html] from CXF for how to add it. The only change from that site is that we will need to add a {{CrossOriginResourceSharingFilter}} as a provider. Ideally, this is configurable (limit which resources have CORS, and which origins are allowed). But, I'm not thinking of any general methods of how to do that... -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TIKA-1586) Enable CORS on Tika Server
[ https://issues.apache.org/jira/browse/TIKA-1586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich resolved TIKA-1586. --- Resolution: Fixed Fixed in r1669799. Enable CORS on Tika Server -- Key: TIKA-1586 URL: https://issues.apache.org/jira/browse/TIKA-1586 Project: Tika Issue Type: New Feature Components: server Reporter: Tyler Palsulich Assignee: Tyler Palsulich Tika Server should allow configuration of CORS requests (for uses like TIKA-1585). See [this example|http://cxf.apache.org/docs/jax-rs-cors.html] from CXF for how to add it. The only change from that site is that we will need to add a {{CrossOriginResourceSharingFilter}} as a provider. Ideally, this is configurable (limit which resources have CORS, and which origins are allowed). But, I'm not thinking of any general methods of how to do that... -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1354) ForkParser doesn't work in OSGI container
[ https://issues.apache.org/jira/browse/TIKA-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385403#comment-14385403 ] Chris A. Mattmann commented on TIKA-1354: - thanks! ForkParser doesn't work in OSGI container - Key: TIKA-1354 URL: https://issues.apache.org/jira/browse/TIKA-1354 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.6 Reporter: Michal Hlavac Fix For: 1.7 I can't find way to run ForkParser in OSGI container. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1577) NetCDF Data Extraction
[ https://issues.apache.org/jira/browse/TIKA-1577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385408#comment-14385408 ] Chris A. Mattmann commented on TIKA-1577: - Agreed, if we can reuse this, then great. The one catch is that I'm not sure that dump capability generates a table or something in an XHTML representation which is our basis representation in Tika. I would like us to consider the output of this issue to be: - TikaParser generates XHTML tabular and other elements that represent the data in the NetCDF file - we create like a ScientificContentHandler that can then take that output from the parser (in the data section) and then format it e.g., like NCDump. Sound good? NetCDF Data Extraction -- Key: TIKA-1577 URL: https://issues.apache.org/jira/browse/TIKA-1577 Project: Tika Issue Type: Improvement Components: handler, parser Affects Versions: 1.7 Reporter: Ann Burgess Assignee: Ann Burgess Labels: features, handler Fix For: 1.8 Original Estimate: 504h Remaining Estimate: 504h A netCDF classic or 64-bit offset dataset is stored as a single file comprising two parts: - a header, containing all the information about dimensions, attributes, and variables except for the variable data; - a data part, comprising fixed-size data, containing the data for variables that don't have an unlimited dimension; and variable-size data, containing the data for variables that have an unlimited dimension. The NetCDFparser currently extracts the header part. -- text extracts file Dimensions and Variables -- metadata extracts Global Attributes We want the option to extract the data part of NetCDF files. Lets use the NetCDF test file for our dev testing: tika/tika-parsers/src/test/resources/test-documents/sresa1b_ncar_ccsm3_0_run1_21.nc -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1586) Enable CORS on Tika Server
[ https://issues.apache.org/jira/browse/TIKA-1586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385407#comment-14385407 ] Hudson commented on TIKA-1586: -- SUCCESS: Integrated in tika-trunk-jdk1.7 #578 (See [https://builds.apache.org/job/tika-trunk-jdk1.7/578/]) TIKA-1586. Enable CORS requests on Tika server. This fixes #37. (tpalsulich: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1669799) * /tika/trunk/CHANGES.txt * /tika/trunk/tika-server/pom.xml * /tika/trunk/tika-server/src/main/java/org/apache/tika/server/TikaServerCli.java Enable CORS on Tika Server -- Key: TIKA-1586 URL: https://issues.apache.org/jira/browse/TIKA-1586 Project: Tika Issue Type: New Feature Components: server Reporter: Tyler Palsulich Assignee: Tyler Palsulich Tika Server should allow configuration of CORS requests (for uses like TIKA-1585). See [this example|http://cxf.apache.org/docs/jax-rs-cors.html] from CXF for how to add it. The only change from that site is that we will need to add a {{CrossOriginResourceSharingFilter}} as a provider. Ideally, this is configurable (limit which resources have CORS, and which origins are allowed). But, I'm not thinking of any general methods of how to do that... -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: trunk test failure
Thanks Oleg! ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Oleg Tikhonov olegtikho...@gmail.com Reply-To: dev@tika.apache.org dev@tika.apache.org Date: Thursday, March 26, 2015 at 12:19 AM To: dev@tika.apache.org dev@tika.apache.org Subject: Re: trunk test failure Hi Chris, just to confirm: [INFO] [INFO] Reactor Summary: [INFO] [INFO] Apache Tika parent . SUCCESS [ 9.268 s] [INFO] Apache Tika core ... SUCCESS [ 25.823 s] [INFO] Apache Tika parsers SUCCESS [02:41 min] [INFO] Apache Tika XMP SUCCESS [ 1.986 s] [INFO] Apache Tika serialization .. SUCCESS [ 1.604 s] [INFO] Apache Tika batch .. SUCCESS [02:02 min] [INFO] Apache Tika application SUCCESS [ 18.983 s] [INFO] Apache Tika OSGi bundle SUCCESS [ 29.087 s] [INFO] Apache Tika server . SUCCESS [ 46.706 s] [INFO] Apache Tika translate .. SUCCESS [ 9.163 s] [INFO] Apache Tika examples ... SUCCESS [ 4.134 s] [INFO] Apache Tika Java-7 Components .. SUCCESS [ 1.236 s] [INFO] Apache Tika SUCCESS [ 0.017 s] [INFO] [INFO] BUILD SUCCESS [INFO] [INFO] Total time: 07:20 min [INFO] Finished at: 2015-03-26T09:18:46+02:00 [INFO] Final Memory: 91M/848M [INFO] BR, OLeg On Thu, Mar 26, 2015 at 1:21 AM, Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov wrote: OK I am nuts - I was applying the patch from TIKA-1580, but didn’t update Felix in the bundle pom - done now, building again. Yay. ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Mattmann, Chris Mattmann chris.a.mattm...@jpl.nasa.gov Reply-To: dev@tika.apache.org dev@tika.apache.org Date: Wednesday, March 25, 2015 at 6:57 PM To: dev@tika.apache.org dev@tika.apache.org Subject: trunk test failure Hey Anyone else seeing this failure in trunk? Running org.apache.tika.bundle.BundleIT [main] INFO org.ops4j.pax.exam.spi.DefaultExamSystem - Pax Exam System (Version: 4.4.0) created. [main] INFO org.ops4j.pax.exam.junit.impl.ProbeRunner - creating PaxExam runner for class org.apache.tika.bundle.BundleIT [main] INFO org.ops4j.pax.exam.junit.impl.ProbeRunner - running test class org.apache.tika.bundle.BundleIT ERROR: Bundle org.apache.tika.bundle [17] Error starting file:/Users/mattmann/tmp/tika/tika-bundle/target/test-bundles/tika-bundl e. j ar (org.osgi.framework.BundleException: Unresolved constraint in bundle org.apache.tika.bundle [17]: Unable to resolve 17.0: missing requirement [17.0] osgi.wiring.package; ((osgi.wiring.package=org.apache.commons.csv)(version=1.0.0)(!(version = 2 .0.0 org.osgi.framework.BundleException: Unresolved constraint in bundle org.apache.tika.bundle [17]: Unable to resolve 17.0: missing requirement [17.0] osgi.wiring.package; ((osgi.wiring.package=org.apache.commons.csv)(version=1.0.0)(!(version = 2 .0.0))) at org.apache.felix.framework.Felix.resolveBundleRevision(Felix.java:4097) at org.apache.felix.framework.Felix.startBundle(Felix.java:2114) at org.apache.felix.framework.Felix.setActiveStartLevel(Felix.java:1368) at org.apache.felix.framework.FrameworkStartLevelImpl.run(FrameworkStartLev el I mpl.java:308) at java.lang.Thread.run(Thread.java:745) [main] ERROR
[jira] [Commented] (TIKA-1585) Create Example Website with Form Submission
[ https://issues.apache.org/jira/browse/TIKA-1585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385411#comment-14385411 ] Tyler Palsulich commented on TIKA-1585: --- CORS work is now integrated. [~talli...@mitre.org], can you restart the server on 162.242.228.174:9998 with the --cors http://tpalsulich.github.io; option? Then, we can close off the 9997 port (my github.io site is querying 9997, though, so I'll need to update that). Is there an official place we'd like to host the above site? Create Example Website with Form Submission --- Key: TIKA-1585 URL: https://issues.apache.org/jira/browse/TIKA-1585 Project: Tika Issue Type: New Feature Components: example, server Reporter: Tyler Palsulich Assignee: Tyler Palsulich It would be great to have a website where we can direct people who ask what Tika can do for [filetype] without needing them to actually download Tika. Some initial work to do that is [here|http://tpalsulich.github.io/TikaExamples/]. I'm far from a design guru, but I imagine the site as having a form where you can upload a file at the top, checkboxes for if you want metadata, content, or both, and a submit button. The request should be sent with AJAX and the result should populate a {{div}}. One issue with AJAX requests is that Tika Server doesn't currently allow Cross-Origin-Resource-Sharing (CORS). So, we either need to maintain a slightly updated tika-server, or update the server to allow configuration. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
RE: [DISCUSS] Tika 1.8 or 1.7.1
Given how recently we did a 1.7 release, my vote would be for 1.7.1 And to keep this release as simple as possible, just cherry-pick the fix for TIKA-1581 into the 1.7 code base. -- Ken From: Tyler Palsulich Sent: March 28, 2015 8:01:03am PDT To: dev@tika.apache.org Subject: [DISCUSS] Tika 1.8 or 1.7.1 Hi Folks, Now that TIKA-1581 (JHighlight licensing issues) is resolved, we need to release a new version of Tika. I'll volunteer to be the release manager again. Should we release this as 1.8 or 1.7.1? Does anyone have any last minute issues they'd like to finish and see in Tika 1.X? I'd like to get the example working with CORS (TIKA-1585 and TIKA-1586). Any others? Have a good weekend, Tyler -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions training Hadoop, Cascading, Cassandra Solr
[jira] [Commented] (TIKA-1579) Add file type to NetCDFParser
[ https://issues.apache.org/jira/browse/TIKA-1579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385383#comment-14385383 ] Ann Burgess commented on TIKA-1579: --- Yes! On Sat, Mar 28, 2015 at 6:09 AM, Tyler Palsulich (JIRA) j...@apache.org -- -- Ann Bryant Burgess, PhD Postdoctoral Fellow Computer Science Department Viterbi School of Engineering University of Southern California Phone: (585) 738-7549 -- Add file type to NetCDFParser - Key: TIKA-1579 URL: https://issues.apache.org/jira/browse/TIKA-1579 Project: Tika Issue Type: Improvement Components: parser Reporter: Ann Burgess Assignee: Ann Burgess Attachments: TIKA-1579.abburgess.190315.patch.txt [~gostep] explains that, there are three versions of NetCDF (classic format, 64-bit offset, and netCDF-4/HDF5 format). When opening an existing netCDF file, the netCDF library will transparently detect its format so we do not need to adjust according to the detected format. That said, it would be good to know the file type as each can have the .nc extension. This will add patch with add file type to the metadata. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[DISCUSS] Tika 1.8 or 1.7.1
Hi Folks, Now that TIKA-1581 (JHighlight licensing issues) is resolved, we need to release a new version of Tika. I'll volunteer to be the release manager again. Should we release this as 1.8 or 1.7.1? Does anyone have any last minute issues they'd like to finish and see in Tika 1.X? I'd like to get the example working with CORS (TIKA-1585 and TIKA-1586). Any others? Have a good weekend, Tyler
[GitHub] tika pull request: TIKA-1586. Enable CORS requests on Tika server
GitHub user tpalsulich opened a pull request: https://github.com/apache/tika/pull/37 TIKA-1586. Enable CORS requests on Tika server You can merge this pull request into a Git repository by running: $ git pull https://github.com/tpalsulich/tika TIKA-1586 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/37.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #37 commit c296a0459477f1ba088d96fa6ba3895e3a6b3ac5 Author: Tyler Palsulich tpalsul...@gmail.com Date: 2015-03-28T15:45:45Z TIKA-1586. Enable CORS requests on Tika server. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
Re: [DISCUSS] Tika 1.8 or 1.7.1
Hi Tyler - I would VOTE for 1.8. Given the stuff associated with releasing (updating the website; sending emails; waiting periods, etc.) let’s ship all the updates we have too along with the jhighlight fix. Cheers, Chris ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Tyler Palsulich tpalsul...@apache.org Reply-To: dev@tika.apache.org dev@tika.apache.org Date: Saturday, March 28, 2015 at 8:01 AM To: dev@tika.apache.org dev@tika.apache.org Subject: [DISCUSS] Tika 1.8 or 1.7.1 Hi Folks, Now that TIKA-1581 (JHighlight licensing issues) is resolved, we need to release a new version of Tika. I'll volunteer to be the release manager again. Should we release this as 1.8 or 1.7.1? Does anyone have any last minute issues they'd like to finish and see in Tika 1.X? I'd like to get the example working with CORS (TIKA-1585 and TIKA-1586). Any others? Have a good weekend, Tyler
[jira] [Commented] (TIKA-1586) Enable CORS on Tika Server
[ https://issues.apache.org/jira/browse/TIKA-1586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385369#comment-14385369 ] ASF GitHub Bot commented on TIKA-1586: -- GitHub user tpalsulich opened a pull request: https://github.com/apache/tika/pull/37 TIKA-1586. Enable CORS requests on Tika server You can merge this pull request into a Git repository by running: $ git pull https://github.com/tpalsulich/tika TIKA-1586 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/37.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #37 commit c296a0459477f1ba088d96fa6ba3895e3a6b3ac5 Author: Tyler Palsulich tpalsul...@gmail.com Date: 2015-03-28T15:45:45Z TIKA-1586. Enable CORS requests on Tika server. Enable CORS on Tika Server -- Key: TIKA-1586 URL: https://issues.apache.org/jira/browse/TIKA-1586 Project: Tika Issue Type: New Feature Components: server Reporter: Tyler Palsulich Assignee: Tyler Palsulich Tika Server should allow configuration of CORS requests (for uses like TIKA-1585). See [this example|http://cxf.apache.org/docs/jax-rs-cors.html] from CXF for how to add it. The only change from that site is that we will need to add a {{CrossOriginResourceSharingFilter}} as a provider. Ideally, this is configurable (limit which resources have CORS, and which origins are allowed). But, I'm not thinking of any general methods of how to do that... -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1586) Enable CORS on Tika Server
[ https://issues.apache.org/jira/browse/TIKA-1586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385372#comment-14385372 ] Tyler Palsulich commented on TIKA-1586: --- Can someone take a look at the above PR and make sure I'm not doing anything bone-headed? Thanks! Enable CORS on Tika Server -- Key: TIKA-1586 URL: https://issues.apache.org/jira/browse/TIKA-1586 Project: Tika Issue Type: New Feature Components: server Reporter: Tyler Palsulich Assignee: Tyler Palsulich Tika Server should allow configuration of CORS requests (for uses like TIKA-1585). See [this example|http://cxf.apache.org/docs/jax-rs-cors.html] from CXF for how to add it. The only change from that site is that we will need to add a {{CrossOriginResourceSharingFilter}} as a provider. Ideally, this is configurable (limit which resources have CORS, and which origins are allowed). But, I'm not thinking of any general methods of how to do that... -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1584) Tika 1.7 possible regression (nested attachment files not getting parsed)
[ https://issues.apache.org/jira/browse/TIKA-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385440#comment-14385440 ] Tim Allison commented on TIKA-1584: --- Able to attach example triggering doc? By same behavior do you mean that a docx inside a zip is not extracted with -X? Tika 1.7 possible regression (nested attachment files not getting parsed) - Key: TIKA-1584 URL: https://issues.apache.org/jira/browse/TIKA-1584 Project: Tika Issue Type: Bug Components: server Affects Versions: 1.7 Reporter: Rob Tulloh I tried to send this to the tika user list, but got a qmail failure so I am opening a jira to see if I can get help with this. There appears to be a change in the behavior of tika since 1.5 (the last version we have used). In 1.5, if we pass a file with content type of rfc822 which contains a zip that contains a docx file, the entire content would get recursed and the text returned. In 1.7, tika only unwinds as far as the zip file and ignores the content of the contained docx file. This is causing a regression failure in our search tests because the contents of the docx file are not found when searched for. We are testing with tika-server if this helps. If we ask the meta service to just characterize the test data, it correctly determines the input is of type rfc822. However, on extract, the contents of the attachment are not extracted as expected. curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream http://localhost:9998/meta 2/dev/null | grep Content-Type Content-Type,message/rfc822 curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream http://localhost:9998/tika 2/dev/null | grep docx sign.docx --- this is not expected, need contents of this extracted We can easily reproduce this problem with a simple eml file with an attachment. Can someone please comment if this seems like a problem or perhaps we need to change something in our call to get the old behavior? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-1584) Tika 1.7 possible regression (nested attachment files not getting parsed)
[ https://issues.apache.org/jira/browse/TIKA-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385440#comment-14385440 ] Tim Allison edited comment on TIKA-1584 at 3/28/15 6:08 PM: Able to attach example triggering doc? By same behavior do you mean that a docx inside a zip is not extracted with /tika? Any luck w /rmeta? was (Author: talli...@mitre.org): Able to attach example triggering doc? By same behavior do you mean that a docx inside a zip is not extracted with -X? Tika 1.7 possible regression (nested attachment files not getting parsed) - Key: TIKA-1584 URL: https://issues.apache.org/jira/browse/TIKA-1584 Project: Tika Issue Type: Bug Components: server Affects Versions: 1.7 Reporter: Rob Tulloh I tried to send this to the tika user list, but got a qmail failure so I am opening a jira to see if I can get help with this. There appears to be a change in the behavior of tika since 1.5 (the last version we have used). In 1.5, if we pass a file with content type of rfc822 which contains a zip that contains a docx file, the entire content would get recursed and the text returned. In 1.7, tika only unwinds as far as the zip file and ignores the content of the contained docx file. This is causing a regression failure in our search tests because the contents of the docx file are not found when searched for. We are testing with tika-server if this helps. If we ask the meta service to just characterize the test data, it correctly determines the input is of type rfc822. However, on extract, the contents of the attachment are not extracted as expected. curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream http://localhost:9998/meta 2/dev/null | grep Content-Type Content-Type,message/rfc822 curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream http://localhost:9998/tika 2/dev/null | grep docx sign.docx --- this is not expected, need contents of this extracted We can easily reproduce this problem with a simple eml file with an attachment. Can someone please comment if this seems like a problem or perhaps we need to change something in our call to get the old behavior? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1582) Mime Detection based on neural networks with Byte-frequency-histogram
[ https://issues.apache.org/jira/browse/TIKA-1582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luke sh updated TIKA-1582: -- Attachment: nnmodel.docx Documentation Mime Detection based on neural networks with Byte-frequency-histogram -- Key: TIKA-1582 URL: https://issues.apache.org/jira/browse/TIKA-1582 Project: Tika Issue Type: Improvement Components: detector, mime Affects Versions: 1.7 Reporter: Luke sh Priority: Trivial Attachments: nnmodel.docx Content-based mime type detection is one of the popular approaches to detect mime type, there are others based on file extension and magic numbers ; And currently Tika has implemented 3 approaches in detecting mime types; They are : 1) file extensions 2) magic numbers (the most trustworthy in tika) 3) content-type(the header in the http response if present and available) Content-based mime type detection however analyses the distribution of the entire stream of bytes and find a similar pattern for the same type and build a function that is able to group them into one or several classes so as to classify and predict; It is believed this feature might broaden the usage of Tika with a bit more security enforcement for mime type detection. Because we want to build a model that is etched with the patterns it has seen, in some situations we may not trust those types which have not been trained/learned by the model. In some situations, magic numbers imbedded in the files can be copied but the actual content could be a potentially detrimental Troy program. By enforcing the trust on byte frequency patterns, we are able to enhance the security of the detection. The proposed content-based mime detection to be integrated into Tika is based on the machine learning algorithm i.e. neural network with back-propagation. The input: 0-255 bins each of which represent a byte, and and each of which stores the count of occurrences for each byte, and the byte frequency histograms are normalized to fall in the range between 0 and 1, they then are passed to a companding function to enhancement the infrequent bytes. The output of the neural network is a binary decision 1 or 0; Notice BTW, the proposed feature will be implemented with GRB file type as one example. In this example, we build a model that is able to classify GRB file type from non-GRB file types, notice the size of non-GRB files is huge and cannot be easily defined, so there need to be as many negative training example as possible to form this non-GRB types decision boundary. The Neural networks is considered as two stage of processes. Training and classification. The training can be done in any programming language, in this feature /research, the training of neural network is implemented in R and the source can be found in my github repository i.e. https://github.com/LukeLiush/filetypeDetection; i am also going to post a document that describe the use of the program, the syntax/ format of the input and output. After training, we need to export the model and import it to Tika; in Tika, we create a TrainedModelDetector that reads this model file with one or more model parameters or several model files,so it can detect the mime types with the model of those mime types. Details of the research and usage with this proposed feature will be posted on my github shortly. It is worth noting again that in this research we only worked out one model - GRB as one example to demonstrate the use of this content-based mime detection. One of the challenges again is that the non-GRB file types cannot be clearly defined unless we feed our model with some example data for all of the existing file types in the world, but this seems to be too utopian and a bit less likely, so it is better that the set of class/types is given and defined in advance to minimize the problem domain. Another challenge is the size of the training data; even if we know the types we want to classify, getting enough training data to form a model can be also one of the main factors of success. In our example model, grb data are collected from ftp://hydro1.sci.gsfc.nasa.gov/data/; and we find out that the grb data from that source all exhibit a similar pattern, a simple neural network structure is able to predict well, even a linear logistic regression is able to do a good job; However, if we pass the GRB files collected from other source to the model for prediction, then we find out that the model predict poorly and unexpectedly, so this bring up the aspect of whether we need to include all training data or those are of interest, including all data is very expensive so it is necessary to introduce some domain knowledge to
[jira] [Updated] (TIKA-1582) Mime Detection based on neural networks with Byte-frequency-histogram
[ https://issues.apache.org/jira/browse/TIKA-1582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luke sh updated TIKA-1582: -- Attachment: week2-report-histogram comparison.docx histogram comparison Mime Detection based on neural networks with Byte-frequency-histogram -- Key: TIKA-1582 URL: https://issues.apache.org/jira/browse/TIKA-1582 Project: Tika Issue Type: Improvement Components: detector, mime Affects Versions: 1.7 Reporter: Luke sh Priority: Trivial Attachments: nnmodel.docx, week2-report-histogram comparison.docx, week6 report.docx Content-based mime type detection is one of the popular approaches to detect mime type, there are others based on file extension and magic numbers ; And currently Tika has implemented 3 approaches in detecting mime types; They are : 1) file extensions 2) magic numbers (the most trustworthy in tika) 3) content-type(the header in the http response if present and available) Content-based mime type detection however analyses the distribution of the entire stream of bytes and find a similar pattern for the same type and build a function that is able to group them into one or several classes so as to classify and predict; It is believed this feature might broaden the usage of Tika with a bit more security enforcement for mime type detection. Because we want to build a model that is etched with the patterns it has seen, in some situations we may not trust those types which have not been trained/learned by the model. In some situations, magic numbers imbedded in the files can be copied but the actual content could be a potentially detrimental Troy program. By enforcing the trust on byte frequency patterns, we are able to enhance the security of the detection. The proposed content-based mime detection to be integrated into Tika is based on the machine learning algorithm i.e. neural network with back-propagation. The input: 0-255 bins each of which represent a byte, and and each of which stores the count of occurrences for each byte, and the byte frequency histograms are normalized to fall in the range between 0 and 1, they then are passed to a companding function to enhancement the infrequent bytes. The output of the neural network is a binary decision 1 or 0; Notice BTW, the proposed feature will be implemented with GRB file type as one example. In this example, we build a model that is able to classify GRB file type from non-GRB file types, notice the size of non-GRB files is huge and cannot be easily defined, so there need to be as many negative training example as possible to form this non-GRB types decision boundary. The Neural networks is considered as two stage of processes. Training and classification. The training can be done in any programming language, in this feature /research, the training of neural network is implemented in R and the source can be found in my github repository i.e. https://github.com/LukeLiush/filetypeDetection; i am also going to post a document that describe the use of the program, the syntax/ format of the input and output. After training, we need to export the model and import it to Tika; in Tika, we create a TrainedModelDetector that reads this model file with one or more model parameters or several model files,so it can detect the mime types with the model of those mime types. Details of the research and usage with this proposed feature will be posted on my github shortly. It is worth noting again that in this research we only worked out one model - GRB as one example to demonstrate the use of this content-based mime detection. One of the challenges again is that the non-GRB file types cannot be clearly defined unless we feed our model with some example data for all of the existing file types in the world, but this seems to be too utopian and a bit less likely, so it is better that the set of class/types is given and defined in advance to minimize the problem domain. Another challenge is the size of the training data; even if we know the types we want to classify, getting enough training data to form a model can be also one of the main factors of success. In our example model, grb data are collected from ftp://hydro1.sci.gsfc.nasa.gov/data/; and we find out that the grb data from that source all exhibit a similar pattern, a simple neural network structure is able to predict well, even a linear logistic regression is able to do a good job; However, if we pass the GRB files collected from other source to the model for prediction, then we find out that the model predict poorly and unexpectedly, so this bring up the aspect of whether we need to include all training data or those are of interest,
[jira] [Updated] (TIKA-1582) Mime Detection based on neural networks with Byte-frequency-histogram
[ https://issues.apache.org/jira/browse/TIKA-1582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luke sh updated TIKA-1582: -- Attachment: week6 report.docx Test report Mime Detection based on neural networks with Byte-frequency-histogram -- Key: TIKA-1582 URL: https://issues.apache.org/jira/browse/TIKA-1582 Project: Tika Issue Type: Improvement Components: detector, mime Affects Versions: 1.7 Reporter: Luke sh Priority: Trivial Attachments: nnmodel.docx, week6 report.docx Content-based mime type detection is one of the popular approaches to detect mime type, there are others based on file extension and magic numbers ; And currently Tika has implemented 3 approaches in detecting mime types; They are : 1) file extensions 2) magic numbers (the most trustworthy in tika) 3) content-type(the header in the http response if present and available) Content-based mime type detection however analyses the distribution of the entire stream of bytes and find a similar pattern for the same type and build a function that is able to group them into one or several classes so as to classify and predict; It is believed this feature might broaden the usage of Tika with a bit more security enforcement for mime type detection. Because we want to build a model that is etched with the patterns it has seen, in some situations we may not trust those types which have not been trained/learned by the model. In some situations, magic numbers imbedded in the files can be copied but the actual content could be a potentially detrimental Troy program. By enforcing the trust on byte frequency patterns, we are able to enhance the security of the detection. The proposed content-based mime detection to be integrated into Tika is based on the machine learning algorithm i.e. neural network with back-propagation. The input: 0-255 bins each of which represent a byte, and and each of which stores the count of occurrences for each byte, and the byte frequency histograms are normalized to fall in the range between 0 and 1, they then are passed to a companding function to enhancement the infrequent bytes. The output of the neural network is a binary decision 1 or 0; Notice BTW, the proposed feature will be implemented with GRB file type as one example. In this example, we build a model that is able to classify GRB file type from non-GRB file types, notice the size of non-GRB files is huge and cannot be easily defined, so there need to be as many negative training example as possible to form this non-GRB types decision boundary. The Neural networks is considered as two stage of processes. Training and classification. The training can be done in any programming language, in this feature /research, the training of neural network is implemented in R and the source can be found in my github repository i.e. https://github.com/LukeLiush/filetypeDetection; i am also going to post a document that describe the use of the program, the syntax/ format of the input and output. After training, we need to export the model and import it to Tika; in Tika, we create a TrainedModelDetector that reads this model file with one or more model parameters or several model files,so it can detect the mime types with the model of those mime types. Details of the research and usage with this proposed feature will be posted on my github shortly. It is worth noting again that in this research we only worked out one model - GRB as one example to demonstrate the use of this content-based mime detection. One of the challenges again is that the non-GRB file types cannot be clearly defined unless we feed our model with some example data for all of the existing file types in the world, but this seems to be too utopian and a bit less likely, so it is better that the set of class/types is given and defined in advance to minimize the problem domain. Another challenge is the size of the training data; even if we know the types we want to classify, getting enough training data to form a model can be also one of the main factors of success. In our example model, grb data are collected from ftp://hydro1.sci.gsfc.nasa.gov/data/; and we find out that the grb data from that source all exhibit a similar pattern, a simple neural network structure is able to predict well, even a linear logistic regression is able to do a good job; However, if we pass the GRB files collected from other source to the model for prediction, then we find out that the model predict poorly and unexpectedly, so this bring up the aspect of whether we need to include all training data or those are of interest, including all data is very expensive so it is necessary to introduce some
[GitHub] tika pull request: Nn branch
GitHub user LukeLiush opened a pull request: https://github.com/apache/tika/pull/36 Nn branch https://issues.apache.org/jira/browse/TIKA-1582 You can merge this pull request into a Git repository by running: $ git pull https://github.com/LukeLiush/tika nnBranch Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/36.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #36 commit eb04f13260bfb5e4f4b0bf7fd54ecd085995cb92 Author: LukeLiush hanson311...@gmail.com Date: 2015-03-28T07:12:06Z https://issues.apache.org/jira/browse/TIKA-1582 commit acaf27bb666fdef05bdb18d7edcaafe7ccfd9bf5 Author: LukeLiush hanson311...@gmail.com Date: 2015-03-28T07:16:07Z move the comments of apache licence to the top commit 701fcc394ed2110e4c771fbb84999dca77932392 Author: LukeLiush hanson311...@gmail.com Date: 2015-03-28T07:19:43Z add some comments commit 12f290826a88cd99bbf2e1a0385b315e73e3 Author: LukeLiush hanson311...@gmail.com Date: 2015-03-28T07:25:55Z move the example model file to the test resource directory commit 6c8d2e523c427380438f24d90985e28bfdbce050 Author: LukeLiush hanson311...@gmail.com Date: 2015-03-28T07:28:25Z remove empty comment block --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (TIKA-1582) Mime Detection based on neural networks with Byte-frequency-histogram
[ https://issues.apache.org/jira/browse/TIKA-1582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385177#comment-14385177 ] ASF GitHub Bot commented on TIKA-1582: -- GitHub user LukeLiush opened a pull request: https://github.com/apache/tika/pull/36 Nn branch https://issues.apache.org/jira/browse/TIKA-1582 You can merge this pull request into a Git repository by running: $ git pull https://github.com/LukeLiush/tika nnBranch Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/36.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #36 commit eb04f13260bfb5e4f4b0bf7fd54ecd085995cb92 Author: LukeLiush hanson311...@gmail.com Date: 2015-03-28T07:12:06Z https://issues.apache.org/jira/browse/TIKA-1582 commit acaf27bb666fdef05bdb18d7edcaafe7ccfd9bf5 Author: LukeLiush hanson311...@gmail.com Date: 2015-03-28T07:16:07Z move the comments of apache licence to the top commit 701fcc394ed2110e4c771fbb84999dca77932392 Author: LukeLiush hanson311...@gmail.com Date: 2015-03-28T07:19:43Z add some comments commit 12f290826a88cd99bbf2e1a0385b315e73e3 Author: LukeLiush hanson311...@gmail.com Date: 2015-03-28T07:25:55Z move the example model file to the test resource directory commit 6c8d2e523c427380438f24d90985e28bfdbce050 Author: LukeLiush hanson311...@gmail.com Date: 2015-03-28T07:28:25Z remove empty comment block Mime Detection based on neural networks with Byte-frequency-histogram -- Key: TIKA-1582 URL: https://issues.apache.org/jira/browse/TIKA-1582 Project: Tika Issue Type: Improvement Components: detector, mime Affects Versions: 1.7 Reporter: Luke sh Priority: Trivial Content-based mime type detection is one of the popular approaches to detect mime type, there are others based on file extension and magic numbers ; And currently Tika has implemented 3 approaches in detecting mime types; They are : 1) file extensions 2) magic numbers (the most trustworthy in tika) 3) content-type(the header in the http response if present and available) Content-based mime type detection however analyses the distribution of the entire stream of bytes and find a similar pattern for the same type and build a function that is able to group them into one or several classes so as to classify and predict; It is believed this feature might broaden the usage of Tika with a bit more security enforcement for mime type detection. Because we want to build a model that is etched with the patterns it has seen, in some situations we may not trust those types which have not been trained/learned by the model. In some situations, magic numbers imbedded in the files can be copied but the actual content could be a potentially detrimental Troy program. By enforcing the trust on byte frequency patterns, we are able to enhance the security of the detection. The proposed content-based mime detection to be integrated into Tika is based on the machine learning algorithm i.e. neural network with back-propagation. The input: 0-255 bins each of which represent a byte, and and each of which stores the count of occurrences for each byte, and the byte frequency histograms are normalized to fall in the range between 0 and 1, they then are passed to a companding function to enhancement the infrequent bytes. The output of the neural network is a binary decision 1 or 0; Notice BTW, the proposed feature will be implemented with GRB file type as one example. In this example, we build a model that is able to classify GRB file type from non-GRB file types, notice the size of non-GRB files is huge and cannot be easily defined, so there need to be as many negative training example as possible to form this non-GRB types decision boundary. The Neural networks is considered as two stage of processes. Training and classification. The training can be done in any programming language, in this feature /research, the training of neural network is implemented in R and the source can be found in my github repository i.e. https://github.com/LukeLiush/filetypeDetection; i am also going to post a document that describe the use of the program, the syntax/ format of the input and output. After training, we need to export the model and import it to Tika; in Tika, we create a TrainedModelDetector that reads this model file with one or more model parameters or several model files,so it can detect the mime types with the model of those mime types. Details of the research and usage with this proposed feature will be posted on my github shortly.