Re: [memex-jpl] this week action from luke
Great work Luke and both of these changes make sense. Please send the pull request for that thank you! Great work Giuseppe! Go team! Cheers, Chris Chris Mattmann chris.mattm...@gmail.com -Original Message- From: Luke hanson311...@gmail.com Date: Thursday, April 23, 2015 at 3:08 AM To: 'Luke' hanson311...@gmail.com, Chris Mattmann chris.a.mattm...@jpl.nasa.gov, Chris Mattmann chris.mattm...@gmail.com, 'Totaro, Giuseppe U (3980-Affiliate)' tot...@di.uniroma1.it, dev@tika.apache.org, 'Bryant, Ann C (398G-Affiliate)' anniebry...@gmail.com, 'Zimdars, Paul A (3980-Affiliate)' paul.a.zimd...@jpl.nasa.gov, NSF Polar CyberInfrastructure DR Students nsf-polar-usc-stude...@googlegroups.com, memex-...@googlegroups.com Subject: RE: [memex-jpl] this week action from luke Both patches from Guiseppe all works based on my tests; from the tests I was able to see the magic tag was being appended at the beginning of the file, and the cbor extension was being appended too when running the Nutch dump tool command with the -extension cbor option. Thanks a lot for the kind help, Giuseppe, highly appreciated. I want to please give a big thumb up to Guiseppe's work, it is thorough and considerate too. To professor, with Guiseppe's two patches, we still need to make a bit change in Tika mimetypes.xml (BTW, the cbor magic tag can be used as magic bytes in tika as it does not look very common, even if it accidentally appears in some other type of files, tika will have extension and metadatahint as a fallback strategy). I am going to send another pull request with that change; But before that, it will be great to elaborate what I am going to change to avoid any confusion. Now we have two problems. Problem1: Magic priority 40. The application/xhtml+xml has higher priority(50) than application/cbor (40); [I don't know who (and why) assigned 40 to cbor]; So if xhtml gets read and compared first, cbor will not even be placed in the magic estimation list because it has low priority. Based on the tests, it turns out that it is true that xhtml gets read and compared first with the input file, so any type below the priority 50 will be disregarded. Problem2: again magic priority with 50. In Tika, given a file dumped by the nutch dumper tool, both types (xhtml and cbor) will be selected as candidate mime types and they will be put in the magic estimation list; since xhtml type gets read first, it is placed atop the cbor; in order to break that tie, tika will rely on the decision from the extension method. If the extension method fails to detect the type(for now, let's ignore metadata hint method for simplicity but the same applies to it too), then xhtml gets returned eventually. My pull request to be sent : I am going to set the magic priority of cbor type to 50 the same as xhtml, because it would probably be risky to discard any one of the estimated types without going consult the extension method. Any comments, suggestion, thoughts will be welcomed and appreciated. Thanks Luke -Original Message- From: Luke [mailto:hanson311...@gmail.com] Sent: Wednesday, April 22, 2015 7:45 PM To: 'Mattmann, Chris A (3980)' Cc: 'Chris Mattmann'; 'Totaro, Giuseppe U (3980-Affiliate)'; 'dev@tika.apache.org'; 'Bryant, Ann C (398G-Affiliate)'; 'Zimdars, Paul A (3980-Affiliate)'; 'NSF Polar CyberInfrastructure DR Students'; 'memex-...@googlegroups.com' Subject: RE: [memex-jpl] this week action from luke Hi Prof, The test was finished, the result is expected. Both (tika with the prob feature and the one without it) produced the same stats total, please see the attached matched.txt dumped by the small program that verbatim checks and compares each line in every section of the Stats total between the log produced by the tika that has the feature and the one without it; so if the string.equals(...) satisfies, the string of the line will be dumped out. If there is a mismatch(e.g. the count for a particular mime type is different), an error will be dumped out. Eventually, I don't see any error in the printout, I think the feature seem to have passed the test. The processing time between 2 tests is as follows. The following shows the start time and end time for the test where the Nutch dumper tool with the prob selection feature. from 2015-04-22 15:47:08,330 to 2015-04-22 17:48:28,877 The following shows the start time and end time for the test where the Nutch dumper tool without the tika with the feature. from 2015-04-22 22:41:23,459 to 2015-04-23 00:11:02,767 BTW, I forgot to mention that probabilistic mime selector with default weight settings also gives the following result, because by default I intentionally assign \ a higher weight value on the magic bytes method so as to make it work in a way similar to the old strategy. On the other hands, if I know that extension is more reliable, I can certainly add more weights to the extension approach, in this case, the prob mime selector will
[jira] [Created] (TIKA-1616) Tika Parser for GIBS Metadata
Lewis John McGibbney created TIKA-1616: -- Summary: Tika Parser for GIBS Metadata Key: TIKA-1616 URL: https://issues.apache.org/jira/browse/TIKA-1616 Project: Tika Issue Type: New Feature Components: parser Reporter: Lewis John McGibbney Fix For: 1.9 [GIBS|https://earthdata.nasa.gov/about-eosdis/science-system-description/eosdis-components/global-imagery-browse-services-gibs] metadata currently consists of simple stuff in the WMTS GetCapabilities request (e.g. http://map1.vis.earthdata.nasa.gov/wmts-arctic/1.0.0/WMTSCapabilities.xml) which includes available layers, extents, time ranges, map projections, color maps, etc. We will eventually have more detailed visualization metadata available in ECHO/CMR which will include linkages to data products, provenance, etc. Some investigation and a Tika parser would be excellent to extract and assimilate GIBS Metadata. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] tika pull request: Cbor extension - set cbor magic priority to 50
GitHub user LukeLiush opened a pull request: https://github.com/apache/tika/pull/44 Cbor extension - set cbor magic priority to 50 You can merge this pull request into a Git repository by running: $ git pull https://github.com/LukeLiush/tika cborExtension Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/44.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #44 commit 5b86cccdfc6d637cb44c9f8b2642e438c2ae5ff4 Author: LukeLiush hanson311...@gmail.com Date: 2015-04-21T21:39:07Z add entry for cbor glob extension in the tika-mimetypes.xml commit f449969d876bbf9fc7fa0e979011e199cba2dd3e Author: LukeLiush hanson311...@gmail.com Date: 2015-04-23T22:24:19Z set the application/cbor magic priority to 50 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Updated] (TIKA-1617) Change OSGi Detection test to use OSGi Service
[ https://issues.apache.org/jira/browse/TIKA-1617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bob Paulin updated TIKA-1617: - Attachment: TIKA-1617.patch Patch included. Change OSGi Detection test to use OSGi Service -- Key: TIKA-1617 URL: https://issues.apache.org/jira/browse/TIKA-1617 Project: Tika Issue Type: Test Reporter: Bob Paulin Priority: Minor Attachments: TIKA-1617.patch Currently the testDetection test does not actually use the OSGi service created within the OSGi Framework. I've changed the test to use the service defined in the tika-bundle -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1610) CBOR Parser and detection [improvement]
[ https://issues.apache.org/jira/browse/TIKA-1610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14510426#comment-14510426 ] Hudson commented on TIKA-1610: -- SUCCESS: Integrated in tika-trunk-jdk1.7 #644 (See [https://builds.apache.org/job/tika-trunk-jdk1.7/644/]) TIKA-1610 Bump the CBOR mime magic priority to 60, to be more specific than (x)html, which is what CBOR often contains, and add a detection unit test (nick: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1675755) * /tika/trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml * /tika/trunk/tika-parsers/src/test/java/org/apache/tika/mime/TestMimeTypes.java * /tika/trunk/tika-parsers/src/test/resources/test-documents/NUTCH-1997.cbor CBOR Parser and detection [improvement] --- Key: TIKA-1610 URL: https://issues.apache.org/jira/browse/TIKA-1610 Project: Tika Issue Type: New Feature Components: detector, mime, parser Affects Versions: 1.7 Reporter: Luke sh Assignee: Chris A. Mattmann Priority: Trivial Labels: memex Attachments: 142440269.html, NUTCH-1997.cbor, cbor_tika.mimetypes.xml.jpg, rfc_cbor.jpg CBOR is a data format whose design goals include the possibility of extremely small code size, fairly small message size, and extensibility without the need for version negotiation (cited from http://cbor.io/ ). It would be great if Tika is able to provide the support with CBOR parser and identification. In the current project with Nutch, the Nutch CommonCrawlDataDumper is used to dump the crawled segments to the files in the format of CBOR. In order to read/parse those dumped files by this tool, it would be great if tika is able to support parsing the cbor, the thing is that the CommonCrawlDataDumper is not dumping with correct extension, it dumps with its own rule, the default extension of the dumped file is html, so it might be less painful if tika is able to detect and parse those files without any pre-processing steps. CommonCrawlDataDumper is calling the following to dump with cbor. import com.fasterxml.jackson.dataformat.cbor.CBORFactory; import com.fasterxml.jackson.dataformat.cbor.CBORGenerator; fasterxml is a 3rd party library for converting json to .cbor and Vice Versa. According to RFC 7049 (http://tools.ietf.org/html/rfc7049), it looks like CBOR does not yet have its magic numbers to be detected/identified by other applications (PFA: rfc_cbor.jpg) It seems that the only way to inform other applications of the type as of now is using the extension (i.e. .cbor), or probably content detection (i.e. byte histogram distribution estimation). There is another thing worth the attention, it looks like tika has attempted to add the support with cbor mime detection in the tika-mimetypes.xml (PFA:cbor_tika.mimetypes.xml.jpg); This detection is not working with the cbor file dumped by CommonCrawlDataDumper. According to http://tools.ietf.org/html/rfc7049#section-2.4.5, there is a self-describing Tag 55799 that seems to be used for cbor type identification(the hex code might be 0xd9d9f7), but it is probably up to the application that take care of this tag, and it is also possible that the fasterxml that the nutch dumping tool is missing this tag, an example cbor file dumped by the Nutch tool i.e. CommonCrawlDataDumper has also been attached (PFA: 142440269.html). The following info is cited from the rfc, ...a decoder might be able to parse both CBOR and JSON. Such a decoder would need to mechanically distinguish the two formats. An easy way for an encoder to help the decoder would be to tag the entire CBOR item with tag 55799, the serialization of which will never be found at the beginning of a JSON text... It looks like the a file can have two parts/sections i.e. the plain text parts and the json prettified by cbor, this might be also worth the attention and consideration with the parsing and type identification. On the other hand, it is worth noting that the entries for cbor extension detection needs to be appended in the tika-mimetypes.xml too e.g. glob pattern=*.cbor/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
RE: [memex-jpl] this week action from luke
Both patches from Guiseppe all works based on my tests; from the tests I was able to see the magic tag was being appended at the beginning of the file, and the cbor extension was being appended too when running the Nutch dump tool command with the -extension cbor option. Thanks a lot for the kind help, Giuseppe, highly appreciated. I want to please give a big thumb up to Guiseppe's work, it is thorough and considerate too. To professor, with Guiseppe's two patches, we still need to make a bit change in Tika mimetypes.xml (BTW, the cbor magic tag can be used as magic bytes in tika as it does not look very common, even if it accidentally appears in some other type of files, tika will have extension and metadatahint as a fallback strategy). I am going to send another pull request with that change; But before that, it will be great to elaborate what I am going to change to avoid any confusion. Now we have two problems. Problem1: Magic priority 40. The application/xhtml+xml has higher priority(50) than application/cbor (40); [I don't know who (and why) assigned 40 to cbor]; So if xhtml gets read and compared first, cbor will not even be placed in the magic estimation list because it has low priority. Based on the tests, it turns out that it is true that xhtml gets read and compared first with the input file, so any type below the priority 50 will be disregarded. Problem2: again magic priority with 50. In Tika, given a file dumped by the nutch dumper tool, both types (xhtml and cbor) will be selected as candidate mime types and they will be put in the magic estimation list; since xhtml type gets read first, it is placed atop the cbor; in order to break that tie, tika will rely on the decision from the extension method. If the extension method fails to detect the type(for now, let's ignore metadata hint method for simplicity but the same applies to it too), then xhtml gets returned eventually. My pull request to be sent : I am going to set the magic priority of cbor type to 50 the same as xhtml, because it would probably be risky to discard any one of the estimated types without going consult the extension method. Any comments, suggestion, thoughts will be welcomed and appreciated. Thanks Luke -Original Message- From: Luke [mailto:hanson311...@gmail.com] Sent: Wednesday, April 22, 2015 7:45 PM To: 'Mattmann, Chris A (3980)' Cc: 'Chris Mattmann'; 'Totaro, Giuseppe U (3980-Affiliate)'; 'dev@tika.apache.org'; 'Bryant, Ann C (398G-Affiliate)'; 'Zimdars, Paul A (3980-Affiliate)'; 'NSF Polar CyberInfrastructure DR Students'; 'memex-...@googlegroups.com' Subject: RE: [memex-jpl] this week action from luke Hi Prof, The test was finished, the result is expected. Both (tika with the prob feature and the one without it) produced the same stats total, please see the attached matched.txt dumped by the small program that verbatim checks and compares each line in every section of the Stats total between the log produced by the tika that has the feature and the one without it; so if the string.equals(...) satisfies, the string of the line will be dumped out. If there is a mismatch(e.g. the count for a particular mime type is different), an error will be dumped out. Eventually, I don't see any error in the printout, I think the feature seem to have passed the test. The processing time between 2 tests is as follows. The following shows the start time and end time for the test where the Nutch dumper tool with the prob selection feature. from 2015-04-22 15:47:08,330 to 2015-04-22 17:48:28,877 The following shows the start time and end time for the test where the Nutch dumper tool without the tika with the feature. from 2015-04-22 22:41:23,459 to 2015-04-23 00:11:02,767 BTW, I forgot to mention that probabilistic mime selector with default weight settings also gives the following result, because by default I intentionally assign \ a higher weight value on the magic bytes method so as to make it work in a way similar to the old strategy. On the other hands, if I know that extension is more reliable, I can certainly add more weights to the extension approach, in this case, the prob mime selector will return application/cbor with a higher value of weight. match value=lt;html xmlns= type=string offset=0:1024/ Result: text/html match value=lt;html xmlns= type=string offset=0:6000/ Result: application/xhtml+xml Please kindly let me know if you have any confusion with the tests; Thanks Luke -Original Message- From: Mattmann, Chris A (3980) [mailto:chris.a.mattm...@jpl.nasa.gov] Sent: Wednesday, April 22, 2015 3:49 PM To: Luke Cc: Chris Mattmann; Totaro, Giuseppe U (3980-Affiliate); dev@tika.apache.org; Bryant, Ann C (398G-Affiliate); Zimdars, Paul A (3980-Affiliate); NSF Polar CyberInfrastructure DR Students; memex-...@googlegroups.com Subject: Re: [memex-jpl] this week action from luke Thanks Luke this is probably a good opportunity to test out your Bayesian mime detector
[jira] [Comment Edited] (TIKA-1513) Add mime detection and parsing for dbf files
[ https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14508932#comment-14508932 ] Tim Allison edited comment on TIKA-1513 at 4/23/15 12:25 PM: - Oh, broken files, y, that would explain your concern. And, y, that's pretty bad. Would you be able to run file against a handful of your false positives to see what file says those files are? This is the definition in my magic file, but it is commented out...not sure how file is actually working... {noformat} #0 byte 0x03 #!:mime application/x-dbf #8 leshort 0 #12 leshort0FoxBase+, FoxPro, dBaseIII+, dBaseIV, no memo {noformat} was (Author: talli...@mitre.org): Oh, broken files, y, that would explain your concern. And, y, that's pretty bad. Would you be able to run file against a handful of your false positives to see what file says those files are? Add mime detection and parsing for dbf files Key: TIKA-1513 URL: https://issues.apache.org/jira/browse/TIKA-1513 Project: Tika Issue Type: Improvement Reporter: Tim Allison Priority: Minor Fix For: 1.9 I just came across an Apache licensed dbf parser that is available on [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom]. Let's add dbf parsing to Tika. Any other recommendations for alternate parsers? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files
[ https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14508932#comment-14508932 ] Tim Allison commented on TIKA-1513: --- Oh, broken files, y, that would explain your concern. And, y, that's pretty bad. Would you be able to run file against a handful of your false positives to see what file says those files are? Add mime detection and parsing for dbf files Key: TIKA-1513 URL: https://issues.apache.org/jira/browse/TIKA-1513 Project: Tika Issue Type: Improvement Reporter: Tim Allison Priority: Minor Fix For: 1.9 I just came across an Apache licensed dbf parser that is available on [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom]. Let's add dbf parsing to Tika. Any other recommendations for alternate parsers? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-1615) Html fragments with comments before div elements are not been detected as html
colin created TIKA-1615: --- Summary: Html fragments with comments before div elements are not been detected as html Key: TIKA-1615 URL: https://issues.apache.org/jira/browse/TIKA-1615 Project: Tika Issue Type: Bug Components: detector Affects Versions: 1.7 Reporter: colin We are trying to import html fragments into Solr. The below is not being detected as html !-- test -- div test /div When the comment is removed the fragment is being parsed as html, this functionality was added by https://issues.apache.org/jira/browse/TIKA-1102 To work around this, we added root-XML localName=div/ root-XML localName=DIV/ to the mime-type type=text/html element in tika-mimetypes.xml The fragment is then parsed as expected -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1614) Geo Topic Parser
[ https://issues.apache.org/jira/browse/TIKA-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14510454#comment-14510454 ] Anya Yun Li commented on TIKA-1614: --- Hi Nick, I understand your concern. This is a content-based geoparser method, that is, we identified location names from text, but in order to get the geographical information(longitude, latitude), we need some kind of database for looking up. Here I use Lucene to build index on GeoName's dataset, and this dataset provides such information. Those binary patch are Lucene index on GeoName's dataset. If the above explanation does not answer your question, feel free to contact me. Best, Yun Geo Topic Parser Key: TIKA-1614 URL: https://issues.apache.org/jira/browse/TIKA-1614 Project: Tika Issue Type: New Feature Components: parser Reporter: Anya Yun Li Labels: memex ##Description This program aims to provide the support to identify geonames for any unstructured text data in the project NSF polar research. https://github.com/NSF-Polar-Cyberinfrastructure/datavis-hackathon/issues/1 This project is a content-based geotagging solution, made of a variaty of NLP tools and could be used for any geotagging purposes. ##Workingflow 1. Plain text input is passed to geoparser 2. Location names are extracted from the text using OpenNLP NER 3. Provide two roles: * The most frequent location name choosed as the best match for the input text * Other extracted locations are treated as alternatives (equal) 4. location extracted above, search the best GeoName object and return the resloved objects with fields (name in gazetteer, longitude, latitude) ##How to Use *Cautions*: This program requires at least 1.2 GB disk space for building Lucene Index ```Java function A(stream){ Metadata metadata = new Metadata(); ParseContext context=new ParseContext(); GeoParserConfig config= new GeoParserConfig(); config.setGazetterPath(gazetteerPath); config.setNERModelPath(nerPath); context.set(GeoParserConfig.class, config); geoparser.parse( stream, new BodyContentHandler(), metadata, context); for(String name: metadata.names()){ String value=metadata.get(name); System.out.println(name + + value); } } ``` This parser generates useful geographical information to Tika's Metadata Object. Fields for best matched location: ``` Geographic_NAME Geographic_LONGTITUDE Geographic_LATITUDE ``` Fields for alternatives: ``` Geographic_NAME1 Geographic_LONGTITUDE1 Geographic_LATITUDE1 Geographic_NAME2 Geographic_LONGTITUDE2 Geographic_LATITUDE2 ... ``` If you have any questions, contact me: anyayu...@gmail.com -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1614) Geo Topic Parser
[ https://issues.apache.org/jira/browse/TIKA-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14510365#comment-14510365 ] Nick Burch commented on TIKA-1614: -- Do we really need to pull in all of Apache Lucene to make this work? Normally Lucene users depend on Tika, not the other way around! There's also a lot of chunky binary data in the patch - any chance you could explain what it is, why it's there, how it was generated, how someone could make fixes to it etc? Geo Topic Parser Key: TIKA-1614 URL: https://issues.apache.org/jira/browse/TIKA-1614 Project: Tika Issue Type: New Feature Components: parser Reporter: Anya Yun Li Labels: memex ##Description This program aims to provide the support to identify geonames for any unstructured text data in the project NSF polar research. https://github.com/NSF-Polar-Cyberinfrastructure/datavis-hackathon/issues/1 This project is a content-based geotagging solution, made of a variaty of NLP tools and could be used for any geotagging purposes. ##Workingflow 1. Plain text input is passed to geoparser 2. Location names are extracted from the text using OpenNLP NER 3. Provide two roles: * The most frequent location name choosed as the best match for the input text * Other extracted locations are treated as alternatives (equal) 4. location extracted above, search the best GeoName object and return the resloved objects with fields (name in gazetteer, longitude, latitude) ##How to Use *Cautions*: This program requires at least 1.2 GB disk space for building Lucene Index ```Java function A(stream){ Metadata metadata = new Metadata(); ParseContext context=new ParseContext(); GeoParserConfig config= new GeoParserConfig(); config.setGazetterPath(gazetteerPath); config.setNERModelPath(nerPath); context.set(GeoParserConfig.class, config); geoparser.parse( stream, new BodyContentHandler(), metadata, context); for(String name: metadata.names()){ String value=metadata.get(name); System.out.println(name + + value); } } ``` This parser generates useful geographical information to Tika's Metadata Object. Fields for best matched location: ``` Geographic_NAME Geographic_LONGTITUDE Geographic_LATITUDE ``` Fields for alternatives: ``` Geographic_NAME1 Geographic_LONGTITUDE1 Geographic_LATITUDE1 Geographic_NAME2 Geographic_LONGTITUDE2 Geographic_LATITUDE2 ... ``` If you have any questions, contact me: anyayu...@gmail.com -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1598) Parser Implementation for Streaming Video
[ https://issues.apache.org/jira/browse/TIKA-1598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14510371#comment-14510371 ] Nick Burch commented on TIKA-1598: -- [~rgauss] already maintains support for wrapping FFMpeg for use in Tika at https://github.com/AlfrescoLabs/tika-ffmpeg based on the ExternalParser support - is it possible to re-use / extend that for this additional use-case? Parser Implementation for Streaming Video - Key: TIKA-1598 URL: https://issues.apache.org/jira/browse/TIKA-1598 Project: Tika Issue Type: New Feature Components: parser Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Labels: memex Fix For: 1.9 A number of us have been discussing a Tika implementation which could, for example, bind to a live multimedia stream and parse content from the stream until it finished. An excellent example would be watching Bonnie Scotland beating R. of Ireland in the upcoming European Championship Qualifying - Group D on Sat 13 Jun @ 17:00 GMT :) I located a JMF Wrapper for ffmpeg which 'may' enable us to do this http://sourceforge.net/projects/jffmpeg/ I am not sure... plus it is not licensed liberally enough for us to include so if there are other implementations then please post them here. I 'may' be able to have a crack at implementing this next week. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1610) CBOR Parser and detection [improvement]
[ https://issues.apache.org/jira/browse/TIKA-1610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14510402#comment-14510402 ] Luke sh commented on TIKA-1610: --- Thanks a lot [~gagravarr] for the prompt response. I thought it would be probably be risky if we discard any one of the estimated types because of the magic priority (one is higher than the other, i wanted tika to rely on the extension when there is a tie to break. For now, in this particular case, i also cannot think of any reason why we don't use 60, might be i am too skeptical. Thanks CBOR Parser and detection [improvement] --- Key: TIKA-1610 URL: https://issues.apache.org/jira/browse/TIKA-1610 Project: Tika Issue Type: New Feature Components: detector, mime, parser Affects Versions: 1.7 Reporter: Luke sh Assignee: Chris A. Mattmann Priority: Trivial Labels: memex Attachments: 142440269.html, NUTCH-1997.cbor, cbor_tika.mimetypes.xml.jpg, rfc_cbor.jpg CBOR is a data format whose design goals include the possibility of extremely small code size, fairly small message size, and extensibility without the need for version negotiation (cited from http://cbor.io/ ). It would be great if Tika is able to provide the support with CBOR parser and identification. In the current project with Nutch, the Nutch CommonCrawlDataDumper is used to dump the crawled segments to the files in the format of CBOR. In order to read/parse those dumped files by this tool, it would be great if tika is able to support parsing the cbor, the thing is that the CommonCrawlDataDumper is not dumping with correct extension, it dumps with its own rule, the default extension of the dumped file is html, so it might be less painful if tika is able to detect and parse those files without any pre-processing steps. CommonCrawlDataDumper is calling the following to dump with cbor. import com.fasterxml.jackson.dataformat.cbor.CBORFactory; import com.fasterxml.jackson.dataformat.cbor.CBORGenerator; fasterxml is a 3rd party library for converting json to .cbor and Vice Versa. According to RFC 7049 (http://tools.ietf.org/html/rfc7049), it looks like CBOR does not yet have its magic numbers to be detected/identified by other applications (PFA: rfc_cbor.jpg) It seems that the only way to inform other applications of the type as of now is using the extension (i.e. .cbor), or probably content detection (i.e. byte histogram distribution estimation). There is another thing worth the attention, it looks like tika has attempted to add the support with cbor mime detection in the tika-mimetypes.xml (PFA:cbor_tika.mimetypes.xml.jpg); This detection is not working with the cbor file dumped by CommonCrawlDataDumper. According to http://tools.ietf.org/html/rfc7049#section-2.4.5, there is a self-describing Tag 55799 that seems to be used for cbor type identification(the hex code might be 0xd9d9f7), but it is probably up to the application that take care of this tag, and it is also possible that the fasterxml that the nutch dumping tool is missing this tag, an example cbor file dumped by the Nutch tool i.e. CommonCrawlDataDumper has also been attached (PFA: 142440269.html). The following info is cited from the rfc, ...a decoder might be able to parse both CBOR and JSON. Such a decoder would need to mechanically distinguish the two formats. An easy way for an encoder to help the decoder would be to tag the entire CBOR item with tag 55799, the serialization of which will never be found at the beginning of a JSON text... It looks like the a file can have two parts/sections i.e. the plain text parts and the json prettified by cbor, this might be also worth the attention and consideration with the parsing and type identification. On the other hand, it is worth noting that the entries for cbor extension detection needs to be appended in the tika-mimetypes.xml too e.g. glob pattern=*.cbor/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1610) CBOR Parser and detection [improvement]
[ https://issues.apache.org/jira/browse/TIKA-1610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luke sh updated TIKA-1610: -- Attachment: NUTCH-1997.cbor CBOR Parser and detection [improvement] --- Key: TIKA-1610 URL: https://issues.apache.org/jira/browse/TIKA-1610 Project: Tika Issue Type: New Feature Components: detector, mime, parser Affects Versions: 1.7 Reporter: Luke sh Assignee: Chris A. Mattmann Priority: Trivial Labels: memex Attachments: 142440269.html, NUTCH-1997.cbor, cbor_tika.mimetypes.xml.jpg, rfc_cbor.jpg CBOR is a data format whose design goals include the possibility of extremely small code size, fairly small message size, and extensibility without the need for version negotiation (cited from http://cbor.io/ ). It would be great if Tika is able to provide the support with CBOR parser and identification. In the current project with Nutch, the Nutch CommonCrawlDataDumper is used to dump the crawled segments to the files in the format of CBOR. In order to read/parse those dumped files by this tool, it would be great if tika is able to support parsing the cbor, the thing is that the CommonCrawlDataDumper is not dumping with correct extension, it dumps with its own rule, the default extension of the dumped file is html, so it might be less painful if tika is able to detect and parse those files without any pre-processing steps. CommonCrawlDataDumper is calling the following to dump with cbor. import com.fasterxml.jackson.dataformat.cbor.CBORFactory; import com.fasterxml.jackson.dataformat.cbor.CBORGenerator; fasterxml is a 3rd party library for converting json to .cbor and Vice Versa. According to RFC 7049 (http://tools.ietf.org/html/rfc7049), it looks like CBOR does not yet have its magic numbers to be detected/identified by other applications (PFA: rfc_cbor.jpg) It seems that the only way to inform other applications of the type as of now is using the extension (i.e. .cbor), or probably content detection (i.e. byte histogram distribution estimation). There is another thing worth the attention, it looks like tika has attempted to add the support with cbor mime detection in the tika-mimetypes.xml (PFA:cbor_tika.mimetypes.xml.jpg); This detection is not working with the cbor file dumped by CommonCrawlDataDumper. According to http://tools.ietf.org/html/rfc7049#section-2.4.5, there is a self-describing Tag 55799 that seems to be used for cbor type identification(the hex code might be 0xd9d9f7), but it is probably up to the application that take care of this tag, and it is also possible that the fasterxml that the nutch dumping tool is missing this tag, an example cbor file dumped by the Nutch tool i.e. CommonCrawlDataDumper has also been attached (PFA: 142440269.html). The following info is cited from the rfc, ...a decoder might be able to parse both CBOR and JSON. Such a decoder would need to mechanically distinguish the two formats. An easy way for an encoder to help the decoder would be to tag the entire CBOR item with tag 55799, the serialization of which will never be found at the beginning of a JSON text... It looks like the a file can have two parts/sections i.e. the plain text parts and the json prettified by cbor, this might be also worth the attention and consideration with the parsing and type identification. On the other hand, it is worth noting that the entries for cbor extension detection needs to be appended in the tika-mimetypes.xml too e.g. glob pattern=*.cbor/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-1610) CBOR Parser and detection [improvement]
[ https://issues.apache.org/jira/browse/TIKA-1610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14510382#comment-14510382 ] Luke sh edited comment on TIKA-1610 at 4/24/15 2:43 AM: Notes: The attached cbor file(i.e.NUTCH-1997.cbor) contains both magic bytes for type xhtml and type cbor, with priority 40 on application/cbor, we will have the following issues Problem1: Magic priority 40. The application/xhtml+xml has higher priority(50) than application/cbor (40); [I don't know who (and why) assigned 40 to cbor]; So if xhtml gets read and compared first, cbor will not even be placed in the magic estimation list because it has low priority. Based on the tests, it turns out that it is true that xhtml gets read and compared first with the input file, so any type below the priority 50 will be disregarded. Problem2: again magic priority with 50. In Tika, given a file dumped by the nutch dumper tool, both types (xhtml and cbor) will be selected as candidate mime types and they will be put in the magic estimation list; since xhtml type gets read first, it is placed atop the cbor; in order to break that tie, tika will rely on the decision from the extension method. If the extension method fails to detect the type(for now, let's ignore metadata hint method for simplicity but the same applies to it too), then xhtml gets returned eventually. My pull request to be sent : I am going to set the magic priority of cbor type to 50 the same as xhtml, because it would probably be risky to discard any one of the estimated types without going consult the extension method. was (Author: lukeliush): Notes: The attached cbor file contains both magic bytes for type xhtml and type cbor, with priority 40 on application/cbor, we will have the following issues Problem1: Magic priority 40. The application/xhtml+xml has higher priority(50) than application/cbor (40); [I don't know who (and why) assigned 40 to cbor]; So if xhtml gets read and compared first, cbor will not even be placed in the magic estimation list because it has low priority. Based on the tests, it turns out that it is true that xhtml gets read and compared first with the input file, so any type below the priority 50 will be disregarded. Problem2: again magic priority with 50. In Tika, given a file dumped by the nutch dumper tool, both types (xhtml and cbor) will be selected as candidate mime types and they will be put in the magic estimation list; since xhtml type gets read first, it is placed atop the cbor; in order to break that tie, tika will rely on the decision from the extension method. If the extension method fails to detect the type(for now, let's ignore metadata hint method for simplicity but the same applies to it too), then xhtml gets returned eventually. My pull request to be sent : I am going to set the magic priority of cbor type to 50 the same as xhtml, because it would probably be risky to discard any one of the estimated types without going consult the extension method. CBOR Parser and detection [improvement] --- Key: TIKA-1610 URL: https://issues.apache.org/jira/browse/TIKA-1610 Project: Tika Issue Type: New Feature Components: detector, mime, parser Affects Versions: 1.7 Reporter: Luke sh Assignee: Chris A. Mattmann Priority: Trivial Labels: memex Attachments: 142440269.html, NUTCH-1997.cbor, cbor_tika.mimetypes.xml.jpg, rfc_cbor.jpg CBOR is a data format whose design goals include the possibility of extremely small code size, fairly small message size, and extensibility without the need for version negotiation (cited from http://cbor.io/ ). It would be great if Tika is able to provide the support with CBOR parser and identification. In the current project with Nutch, the Nutch CommonCrawlDataDumper is used to dump the crawled segments to the files in the format of CBOR. In order to read/parse those dumped files by this tool, it would be great if tika is able to support parsing the cbor, the thing is that the CommonCrawlDataDumper is not dumping with correct extension, it dumps with its own rule, the default extension of the dumped file is html, so it might be less painful if tika is able to detect and parse those files without any pre-processing steps. CommonCrawlDataDumper is calling the following to dump with cbor. import com.fasterxml.jackson.dataformat.cbor.CBORFactory; import com.fasterxml.jackson.dataformat.cbor.CBORGenerator; fasterxml is a 3rd party library for converting json to .cbor and Vice Versa. According to RFC 7049 (http://tools.ietf.org/html/rfc7049), it looks like CBOR does not yet have its magic numbers to be detected/identified by other applications (PFA: rfc_cbor.jpg) It seems that
[jira] [Commented] (TIKA-1610) CBOR Parser and detection [improvement]
[ https://issues.apache.org/jira/browse/TIKA-1610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14510382#comment-14510382 ] Luke sh commented on TIKA-1610: --- Notes: The attached cbor file contains both magic bytes for type xhtml and type cbor, with priority 40 on application/cbor, we will have the following issues Problem1: Magic priority 40. The application/xhtml+xml has higher priority(50) than application/cbor (40); [I don't know who (and why) assigned 40 to cbor]; So if xhtml gets read and compared first, cbor will not even be placed in the magic estimation list because it has low priority. Based on the tests, it turns out that it is true that xhtml gets read and compared first with the input file, so any type below the priority 50 will be disregarded. Problem2: again magic priority with 50. In Tika, given a file dumped by the nutch dumper tool, both types (xhtml and cbor) will be selected as candidate mime types and they will be put in the magic estimation list; since xhtml type gets read first, it is placed atop the cbor; in order to break that tie, tika will rely on the decision from the extension method. If the extension method fails to detect the type(for now, let's ignore metadata hint method for simplicity but the same applies to it too), then xhtml gets returned eventually. My pull request to be sent : I am going to set the magic priority of cbor type to 50 the same as xhtml, because it would probably be risky to discard any one of the estimated types without going consult the extension method. CBOR Parser and detection [improvement] --- Key: TIKA-1610 URL: https://issues.apache.org/jira/browse/TIKA-1610 Project: Tika Issue Type: New Feature Components: detector, mime, parser Affects Versions: 1.7 Reporter: Luke sh Assignee: Chris A. Mattmann Priority: Trivial Labels: memex Attachments: 142440269.html, NUTCH-1997.cbor, cbor_tika.mimetypes.xml.jpg, rfc_cbor.jpg CBOR is a data format whose design goals include the possibility of extremely small code size, fairly small message size, and extensibility without the need for version negotiation (cited from http://cbor.io/ ). It would be great if Tika is able to provide the support with CBOR parser and identification. In the current project with Nutch, the Nutch CommonCrawlDataDumper is used to dump the crawled segments to the files in the format of CBOR. In order to read/parse those dumped files by this tool, it would be great if tika is able to support parsing the cbor, the thing is that the CommonCrawlDataDumper is not dumping with correct extension, it dumps with its own rule, the default extension of the dumped file is html, so it might be less painful if tika is able to detect and parse those files without any pre-processing steps. CommonCrawlDataDumper is calling the following to dump with cbor. import com.fasterxml.jackson.dataformat.cbor.CBORFactory; import com.fasterxml.jackson.dataformat.cbor.CBORGenerator; fasterxml is a 3rd party library for converting json to .cbor and Vice Versa. According to RFC 7049 (http://tools.ietf.org/html/rfc7049), it looks like CBOR does not yet have its magic numbers to be detected/identified by other applications (PFA: rfc_cbor.jpg) It seems that the only way to inform other applications of the type as of now is using the extension (i.e. .cbor), or probably content detection (i.e. byte histogram distribution estimation). There is another thing worth the attention, it looks like tika has attempted to add the support with cbor mime detection in the tika-mimetypes.xml (PFA:cbor_tika.mimetypes.xml.jpg); This detection is not working with the cbor file dumped by CommonCrawlDataDumper. According to http://tools.ietf.org/html/rfc7049#section-2.4.5, there is a self-describing Tag 55799 that seems to be used for cbor type identification(the hex code might be 0xd9d9f7), but it is probably up to the application that take care of this tag, and it is also possible that the fasterxml that the nutch dumping tool is missing this tag, an example cbor file dumped by the Nutch tool i.e. CommonCrawlDataDumper has also been attached (PFA: 142440269.html). The following info is cited from the rfc, ...a decoder might be able to parse both CBOR and JSON. Such a decoder would need to mechanically distinguish the two formats. An easy way for an encoder to help the decoder would be to tag the entire CBOR item with tag 55799, the serialization of which will never be found at the beginning of a JSON text... It looks like the a file can have two parts/sections i.e. the plain text parts and the json prettified by cbor, this might be also worth the attention and consideration with
[jira] [Commented] (TIKA-1610) CBOR Parser and detection [improvement]
[ https://issues.apache.org/jira/browse/TIKA-1610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14510394#comment-14510394 ] Nick Burch commented on TIKA-1610: -- Based on that, I think the CBOR mime magic needs to be higher than the (x)html one, not lower and not the same. So, in r1675755. I've set it to 60 and added detection unit tests. These tests failed before the bump from 40 to 60, so I think we're in a better place now! CBOR Parser and detection [improvement] --- Key: TIKA-1610 URL: https://issues.apache.org/jira/browse/TIKA-1610 Project: Tika Issue Type: New Feature Components: detector, mime, parser Affects Versions: 1.7 Reporter: Luke sh Assignee: Chris A. Mattmann Priority: Trivial Labels: memex Attachments: 142440269.html, NUTCH-1997.cbor, cbor_tika.mimetypes.xml.jpg, rfc_cbor.jpg CBOR is a data format whose design goals include the possibility of extremely small code size, fairly small message size, and extensibility without the need for version negotiation (cited from http://cbor.io/ ). It would be great if Tika is able to provide the support with CBOR parser and identification. In the current project with Nutch, the Nutch CommonCrawlDataDumper is used to dump the crawled segments to the files in the format of CBOR. In order to read/parse those dumped files by this tool, it would be great if tika is able to support parsing the cbor, the thing is that the CommonCrawlDataDumper is not dumping with correct extension, it dumps with its own rule, the default extension of the dumped file is html, so it might be less painful if tika is able to detect and parse those files without any pre-processing steps. CommonCrawlDataDumper is calling the following to dump with cbor. import com.fasterxml.jackson.dataformat.cbor.CBORFactory; import com.fasterxml.jackson.dataformat.cbor.CBORGenerator; fasterxml is a 3rd party library for converting json to .cbor and Vice Versa. According to RFC 7049 (http://tools.ietf.org/html/rfc7049), it looks like CBOR does not yet have its magic numbers to be detected/identified by other applications (PFA: rfc_cbor.jpg) It seems that the only way to inform other applications of the type as of now is using the extension (i.e. .cbor), or probably content detection (i.e. byte histogram distribution estimation). There is another thing worth the attention, it looks like tika has attempted to add the support with cbor mime detection in the tika-mimetypes.xml (PFA:cbor_tika.mimetypes.xml.jpg); This detection is not working with the cbor file dumped by CommonCrawlDataDumper. According to http://tools.ietf.org/html/rfc7049#section-2.4.5, there is a self-describing Tag 55799 that seems to be used for cbor type identification(the hex code might be 0xd9d9f7), but it is probably up to the application that take care of this tag, and it is also possible that the fasterxml that the nutch dumping tool is missing this tag, an example cbor file dumped by the Nutch tool i.e. CommonCrawlDataDumper has also been attached (PFA: 142440269.html). The following info is cited from the rfc, ...a decoder might be able to parse both CBOR and JSON. Such a decoder would need to mechanically distinguish the two formats. An easy way for an encoder to help the decoder would be to tag the entire CBOR item with tag 55799, the serialization of which will never be found at the beginning of a JSON text... It looks like the a file can have two parts/sections i.e. the plain text parts and the json prettified by cbor, this might be also worth the attention and consideration with the parsing and type identification. On the other hand, it is worth noting that the entries for cbor extension detection needs to be appended in the tika-mimetypes.xml too e.g. glob pattern=*.cbor/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-1617) Change OSGi Detection test to use OSGi Service
Bob Paulin created TIKA-1617: Summary: Change OSGi Detection test to use OSGi Service Key: TIKA-1617 URL: https://issues.apache.org/jira/browse/TIKA-1617 Project: Tika Issue Type: Test Reporter: Bob Paulin Priority: Minor Currently the testDetection test does not actually use the OSGi service created within the OSGi Framework. I've changed the test to use the service defined in the tika-bundle -- This message was sent by Atlassian JIRA (v6.3.4#6332)