Re: [memex-jpl] this week action from luke

2015-04-23 Thread Chris Mattmann
Great work Luke and both of these changes make sense.
Please send the pull request for that thank you!

Great work Giuseppe! Go team!

Cheers,
Chris


Chris Mattmann
chris.mattm...@gmail.com




-Original Message-
From: Luke hanson311...@gmail.com
Date: Thursday, April 23, 2015 at 3:08 AM
To: 'Luke' hanson311...@gmail.com, Chris Mattmann
chris.a.mattm...@jpl.nasa.gov, Chris Mattmann
chris.mattm...@gmail.com, 'Totaro, Giuseppe U (3980-Affiliate)'
tot...@di.uniroma1.it, dev@tika.apache.org, 'Bryant, Ann C
(398G-Affiliate)' anniebry...@gmail.com, 'Zimdars, Paul A
(3980-Affiliate)' paul.a.zimd...@jpl.nasa.gov, NSF Polar
CyberInfrastructure DR Students nsf-polar-usc-stude...@googlegroups.com,
memex-...@googlegroups.com
Subject: RE: [memex-jpl] this week action from luke

Both patches from Guiseppe all works based on my tests;  from the tests I
was able to see the magic tag was being appended at the beginning of the
file, and the cbor extension was being appended too when running the Nutch
dump tool command with the -extension cbor option. Thanks a lot for the
kind help, Giuseppe, highly appreciated. I want to please give a big thumb
up to Guiseppe's work, it is thorough and considerate too.

To professor, 
with Guiseppe's two patches, we still need to make a bit change in Tika
mimetypes.xml (BTW, the cbor magic tag can be used as magic bytes in tika
as
it does not look very common, even if it accidentally appears in some
other
type of files, tika will have extension and metadatahint as a fallback
strategy). I am going to send another pull request with that change;
But before that, it will be great to elaborate what I am going to change
to
avoid any confusion.

Now we have two problems.
Problem1: Magic priority 40.
   The application/xhtml+xml has higher priority(50) than
application/cbor (40); [I don't know who (and why) assigned 40 to cbor];
So
if xhtml gets read and compared first,  cbor will not even be placed in
the
magic estimation list because it has low priority. Based on the tests, it
turns out that it is true that xhtml gets read and compared first with the
input file, so any type below the priority 50 will be disregarded.


Problem2: again magic priority with 50.
   In Tika, given a file dumped by the nutch dumper tool,  both types
(xhtml and cbor) will be selected as candidate mime types and they will be
put in the magic estimation list; since xhtml type gets read first, it is
placed atop the cbor; in order to break that tie, tika will rely on the
decision from the extension method. If the extension method fails to
detect
the type(for now, let's ignore metadata hint method for simplicity but the
same applies to it too), then xhtml gets returned eventually.

My pull request to be sent : I am going to set the magic priority of cbor
type to 50 the same as xhtml, because it would probably be risky to
discard
any one of the estimated types without going consult the extension method.

Any comments, suggestion, thoughts will be welcomed and appreciated.

Thanks
Luke

-Original Message-
From: Luke [mailto:hanson311...@gmail.com]
Sent: Wednesday, April 22, 2015 7:45 PM
To: 'Mattmann, Chris A (3980)'
Cc: 'Chris Mattmann'; 'Totaro, Giuseppe U (3980-Affiliate)';
'dev@tika.apache.org'; 'Bryant, Ann C (398G-Affiliate)'; 'Zimdars, Paul A
(3980-Affiliate)'; 'NSF Polar CyberInfrastructure DR Students';
'memex-...@googlegroups.com'
Subject: RE: [memex-jpl] this week action from luke

Hi Prof,

The test was finished, the result is expected.
Both (tika with the prob feature and the one without it) produced the same
stats total, please see the attached matched.txt dumped by the small
program that verbatim checks and compares each line in every section of
the
Stats total between the log produced by the tika that has the feature
and
the one without it;  so if the string.equals(...) satisfies, the string of
the line will be dumped out. If there is a mismatch(e.g. the count for a
particular mime type is different), an error will be dumped out.
Eventually,
I don't see any error in the printout, I think the feature seem to have
passed the test.


The processing time between 2 tests is as follows.
The following shows the start time and end time for the test where the
Nutch
dumper tool with the prob selection feature.
from
2015-04-22 15:47:08,330
to
2015-04-22 17:48:28,877

The following shows the start time and end time for the test where the
Nutch
dumper tool without the tika with the feature.
from
2015-04-22 22:41:23,459
to
2015-04-23 00:11:02,767


BTW, I forgot to mention that probabilistic mime selector with default
weight settings also gives the following result, because by default I
intentionally assign \ a higher weight value on the magic bytes method so
as
to make it work in a way similar to the old strategy. On the other hands,
if
I know that extension is more reliable, I can certainly add more weights
to
the extension approach, in this case, the prob mime selector will 

[jira] [Created] (TIKA-1616) Tika Parser for GIBS Metadata

2015-04-23 Thread Lewis John McGibbney (JIRA)
Lewis John McGibbney created TIKA-1616:
--

 Summary: Tika Parser for GIBS Metadata
 Key: TIKA-1616
 URL: https://issues.apache.org/jira/browse/TIKA-1616
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Lewis John McGibbney
 Fix For: 1.9


[GIBS|https://earthdata.nasa.gov/about-eosdis/science-system-description/eosdis-components/global-imagery-browse-services-gibs]
 metadata currently consists of simple stuff in the WMTS GetCapabilities 
request (e.g. 
http://map1.vis.earthdata.nasa.gov/wmts-arctic/1.0.0/WMTSCapabilities.xml) 
which includes available layers, extents, time ranges, map projections, color 
maps, etc. We will eventually have more detailed visualization metadata 
available in ECHO/CMR which will include linkages to data products, provenance, 
etc. 
Some investigation and a Tika parser would be excellent to extract and 
assimilate GIBS Metadata.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] tika pull request: Cbor extension - set cbor magic priority to 50

2015-04-23 Thread LukeLiush
GitHub user LukeLiush opened a pull request:

https://github.com/apache/tika/pull/44

Cbor extension  - set cbor magic priority to 50



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/LukeLiush/tika cborExtension

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tika/pull/44.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #44


commit 5b86cccdfc6d637cb44c9f8b2642e438c2ae5ff4
Author: LukeLiush hanson311...@gmail.com
Date:   2015-04-21T21:39:07Z

add entry for cbor glob extension in the tika-mimetypes.xml

commit f449969d876bbf9fc7fa0e979011e199cba2dd3e
Author: LukeLiush hanson311...@gmail.com
Date:   2015-04-23T22:24:19Z

set the application/cbor magic priority to 50




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Updated] (TIKA-1617) Change OSGi Detection test to use OSGi Service

2015-04-23 Thread Bob Paulin (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bob Paulin updated TIKA-1617:
-
Attachment: TIKA-1617.patch

Patch included.

 Change OSGi Detection test to use OSGi Service
 --

 Key: TIKA-1617
 URL: https://issues.apache.org/jira/browse/TIKA-1617
 Project: Tika
  Issue Type: Test
Reporter: Bob Paulin
Priority: Minor
 Attachments: TIKA-1617.patch


 Currently the testDetection test does not actually use the OSGi service 
 created within the OSGi Framework.  I've changed the test to use the service 
 defined in the tika-bundle



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1610) CBOR Parser and detection [improvement]

2015-04-23 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14510426#comment-14510426
 ] 

Hudson commented on TIKA-1610:
--

SUCCESS: Integrated in tika-trunk-jdk1.7 #644 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/644/])
TIKA-1610 Bump the CBOR mime magic priority to 60, to be more specific than 
(x)html, which is what CBOR often contains, and add a detection unit test 
(nick: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1675755)
* 
/tika/trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
* /tika/trunk/tika-parsers/src/test/java/org/apache/tika/mime/TestMimeTypes.java
* /tika/trunk/tika-parsers/src/test/resources/test-documents/NUTCH-1997.cbor


 CBOR Parser and detection [improvement]
 ---

 Key: TIKA-1610
 URL: https://issues.apache.org/jira/browse/TIKA-1610
 Project: Tika
  Issue Type: New Feature
  Components: detector, mime, parser
Affects Versions: 1.7
Reporter: Luke sh
Assignee: Chris A. Mattmann
Priority: Trivial
  Labels: memex
 Attachments: 142440269.html, NUTCH-1997.cbor, 
 cbor_tika.mimetypes.xml.jpg, rfc_cbor.jpg


 CBOR is a data format whose design goals include the possibility of extremely 
 small code size, fairly small message size, and extensibility without the 
 need for version negotiation (cited from http://cbor.io/ ).
 It would be great if Tika is able to provide the support with CBOR parser and 
 identification. In the current project with Nutch, the Nutch 
 CommonCrawlDataDumper is used to dump the crawled segments to the files in 
 the format of CBOR. In order to read/parse those dumped files by this tool, 
 it would be great if tika is able to support parsing the cbor, the thing is 
 that the CommonCrawlDataDumper is not dumping with correct extension, it 
 dumps with its own rule, the default extension of the dumped file is html, so 
 it might be less painful if tika is able to detect and parse those files 
 without any pre-processing steps. 
 CommonCrawlDataDumper is calling the following to dump with cbor.
 import com.fasterxml.jackson.dataformat.cbor.CBORFactory;
 import com.fasterxml.jackson.dataformat.cbor.CBORGenerator;
 fasterxml is a 3rd party library for converting json to .cbor and Vice Versa.
 According to RFC 7049 (http://tools.ietf.org/html/rfc7049), it looks like 
 CBOR does not yet have its magic numbers to be detected/identified by other 
 applications (PFA: rfc_cbor.jpg)
 It seems that the only way to inform other applications of the type as of now 
 is using the extension (i.e. .cbor), or probably content detection (i.e. byte 
 histogram distribution estimation).  
 There is another thing worth the attention, it looks like tika has attempted 
 to add the support with cbor mime detection in the tika-mimetypes.xml 
 (PFA:cbor_tika.mimetypes.xml.jpg); This detection is not working with the 
 cbor file dumped by CommonCrawlDataDumper. 
 According to http://tools.ietf.org/html/rfc7049#section-2.4.5, there is a 
 self-describing Tag 55799 that seems to be used for cbor type 
 identification(the hex code might be 0xd9d9f7), but it is probably up to the 
 application that take care of this tag, and it is also possible that the 
 fasterxml that the nutch dumping tool is missing this tag, an example cbor 
 file dumped by the Nutch tool i.e. CommonCrawlDataDumper has also been 
 attached (PFA: 142440269.html).
 The following info is cited from the rfc, ...a decoder might be able to 
 parse both CBOR and JSON.
Such a decoder would need to mechanically distinguish the two
formats.  An easy way for an encoder to help the decoder would be to
tag the entire CBOR item with tag 55799, the serialization of which
will never be found at the beginning of a JSON text...
 It looks like the a file can have two parts/sections i.e. the plain text 
 parts and the json prettified by cbor, this might be also worth the attention 
 and consideration with the parsing and type identification.
 On the other hand, it is worth noting that the entries for cbor extension 
 detection needs to be appended in the tika-mimetypes.xml too 
 e.g.
 glob pattern=*.cbor/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


RE: [memex-jpl] this week action from luke

2015-04-23 Thread Luke
Both patches from Guiseppe all works based on my tests;  from the tests I
was able to see the magic tag was being appended at the beginning of the
file, and the cbor extension was being appended too when running the Nutch
dump tool command with the -extension cbor option. Thanks a lot for the
kind help, Giuseppe, highly appreciated. I want to please give a big thumb
up to Guiseppe's work, it is thorough and considerate too. 

To professor, 
with Guiseppe's two patches, we still need to make a bit change in Tika
mimetypes.xml (BTW, the cbor magic tag can be used as magic bytes in tika as
it does not look very common, even if it accidentally appears in some other
type of files, tika will have extension and metadatahint as a fallback
strategy). I am going to send another pull request with that change;
But before that, it will be great to elaborate what I am going to change to
avoid any confusion.

Now we have two problems.
Problem1: Magic priority 40.
The application/xhtml+xml has higher priority(50) than
application/cbor (40); [I don't know who (and why) assigned 40 to cbor];  So
if xhtml gets read and compared first,  cbor will not even be placed in the
magic estimation list because it has low priority. Based on the tests, it
turns out that it is true that xhtml gets read and compared first with the
input file, so any type below the priority 50 will be disregarded.


Problem2: again magic priority with 50.
In Tika, given a file dumped by the nutch dumper tool,  both types
(xhtml and cbor) will be selected as candidate mime types and they will be
put in the magic estimation list; since xhtml type gets read first, it is
placed atop the cbor; in order to break that tie, tika will rely on the
decision from the extension method. If the extension method fails to detect
the type(for now, let's ignore metadata hint method for simplicity but the
same applies to it too), then xhtml gets returned eventually.

My pull request to be sent : I am going to set the magic priority of cbor
type to 50 the same as xhtml, because it would probably be risky to discard
any one of the estimated types without going consult the extension method.

Any comments, suggestion, thoughts will be welcomed and appreciated.

Thanks
Luke

-Original Message-
From: Luke [mailto:hanson311...@gmail.com] 
Sent: Wednesday, April 22, 2015 7:45 PM
To: 'Mattmann, Chris A (3980)'
Cc: 'Chris Mattmann'; 'Totaro, Giuseppe U (3980-Affiliate)';
'dev@tika.apache.org'; 'Bryant, Ann C (398G-Affiliate)'; 'Zimdars, Paul A
(3980-Affiliate)'; 'NSF Polar CyberInfrastructure DR Students';
'memex-...@googlegroups.com'
Subject: RE: [memex-jpl] this week action from luke

Hi Prof,

The test was finished, the result is expected.
Both (tika with the prob feature and the one without it) produced the same
stats total, please see the attached matched.txt dumped by the small
program that verbatim checks and compares each line in every section of the
Stats total between the log produced by the tika that has the feature and
the one without it;  so if the string.equals(...) satisfies, the string of
the line will be dumped out. If there is a mismatch(e.g. the count for a
particular mime type is different), an error will be dumped out. Eventually,
I don't see any error in the printout, I think the feature seem to have
passed the test.


The processing time between 2 tests is as follows.
The following shows the start time and end time for the test where the Nutch
dumper tool with the prob selection feature.
from
2015-04-22 15:47:08,330
to
2015-04-22 17:48:28,877

The following shows the start time and end time for the test where the Nutch
dumper tool without the tika with the feature.
from
2015-04-22 22:41:23,459
to
2015-04-23 00:11:02,767


BTW, I forgot to mention that probabilistic mime selector with default
weight settings also gives the following result, because by default I
intentionally assign \ a higher weight value on the magic bytes method so as
to make it work in a way similar to the old strategy. On the other hands, if
I know that extension is more reliable, I can certainly add more weights to
the extension approach, in this case, the prob mime selector will return
application/cbor with a higher value of weight.

 match value=lt;html xmlns= type=string offset=0:1024/
 Result: text/html
 
 match value=lt;html xmlns= type=string offset=0:6000/
 Result: application/xhtml+xml


Please kindly let me know if you have any confusion with the tests;


Thanks
Luke

-Original Message-
From: Mattmann, Chris A (3980) [mailto:chris.a.mattm...@jpl.nasa.gov]
Sent: Wednesday, April 22, 2015 3:49 PM
To: Luke
Cc: Chris Mattmann; Totaro, Giuseppe U (3980-Affiliate);
dev@tika.apache.org; Bryant, Ann C (398G-Affiliate); Zimdars, Paul A
(3980-Affiliate); NSF Polar CyberInfrastructure DR Students;
memex-...@googlegroups.com
Subject: Re: [memex-jpl] this week action from luke

Thanks Luke this is probably a good opportunity to test out your Bayesian
mime detector 

[jira] [Comment Edited] (TIKA-1513) Add mime detection and parsing for dbf files

2015-04-23 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14508932#comment-14508932
 ] 

Tim Allison edited comment on TIKA-1513 at 4/23/15 12:25 PM:
-

Oh, broken files, y, that would explain your concern.  And, y, that's pretty 
bad. 

Would you be able to run file against a handful of your false positives to 
see what file says those files are?

This is the definition in my magic file, but it is commented out...not sure how 
file is actually working...

{noformat}
#0  byte   0x03
#!:mime application/x-dbf
#8 leshort   0
#12   leshort0FoxBase+, FoxPro, dBaseIII+, dBaseIV, no memo
{noformat}


was (Author: talli...@mitre.org):
Oh, broken files, y, that would explain your concern.  And, y, that's pretty 
bad. 

Would you be able to run file against a handful of your false positives to 
see what file says those files are?

 Add mime detection and parsing for dbf files
 

 Key: TIKA-1513
 URL: https://issues.apache.org/jira/browse/TIKA-1513
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Priority: Minor
 Fix For: 1.9


 I just came across an Apache licensed dbf parser that is available on 
 [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom].
 Let's add dbf parsing to Tika.
 Any other recommendations for alternate parsers?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files

2015-04-23 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14508932#comment-14508932
 ] 

Tim Allison commented on TIKA-1513:
---

Oh, broken files, y, that would explain your concern.  And, y, that's pretty 
bad. 

Would you be able to run file against a handful of your false positives to 
see what file says those files are?

 Add mime detection and parsing for dbf files
 

 Key: TIKA-1513
 URL: https://issues.apache.org/jira/browse/TIKA-1513
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Priority: Minor
 Fix For: 1.9


 I just came across an Apache licensed dbf parser that is available on 
 [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom].
 Let's add dbf parsing to Tika.
 Any other recommendations for alternate parsers?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-1615) Html fragments with comments before div elements are not been detected as html

2015-04-23 Thread colin (JIRA)
colin created TIKA-1615:
---

 Summary: Html fragments with comments before div elements are not 
been detected as html
 Key: TIKA-1615
 URL: https://issues.apache.org/jira/browse/TIKA-1615
 Project: Tika
  Issue Type: Bug
  Components: detector
Affects Versions: 1.7
Reporter: colin


We are trying to import html fragments into Solr.

The below is not being detected as html

!-- test --
div
 test
/div

When the comment is removed the fragment is being parsed as html, this 
functionality was added by https://issues.apache.org/jira/browse/TIKA-1102

To work around this, we added 

root-XML localName=div/
root-XML localName=DIV/

to the mime-type type=text/html element in tika-mimetypes.xml

The fragment is then parsed as expected









--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1614) Geo Topic Parser

2015-04-23 Thread Anya Yun Li (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14510454#comment-14510454
 ] 

Anya Yun Li commented on TIKA-1614:
---

Hi Nick,
I understand your concern. This is a content-based geoparser method, that is, 
we identified location names from text, but in order to get the geographical 
information(longitude, latitude), we need some kind of database for looking up. 
Here I use Lucene to build index on GeoName's dataset, and this dataset 
provides such information.
Those binary patch are Lucene index on GeoName's dataset.
If the above explanation does not answer your question, feel free to contact me.

Best,
Yun

 Geo Topic Parser
 

 Key: TIKA-1614
 URL: https://issues.apache.org/jira/browse/TIKA-1614
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Anya Yun Li
  Labels: memex

 ##Description
 This program aims to provide the support to identify geonames for any 
 unstructured text data in the project NSF polar research. 
 https://github.com/NSF-Polar-Cyberinfrastructure/datavis-hackathon/issues/1
 This project is a content-based geotagging solution, made of a variaty of NLP 
 tools and could be used for any geotagging purposes. 
 ##Workingflow
 1. Plain text input is passed to geoparser
 2. Location names are extracted from the text using OpenNLP NER
 3. Provide two roles: 
   * The most frequent location name choosed as the best match for the 
 input text
   * Other extracted locations are treated as alternatives (equal)
 4. location extracted above, search the best GeoName object and return the 
 resloved objects with fields (name in gazetteer, longitude, latitude)
 ##How to Use
 *Cautions*: This program requires at least 1.2 GB disk space for building 
 Lucene Index
 ```Java
   function A(stream){
   Metadata metadata = new Metadata();
 ParseContext context=new ParseContext();
 GeoParserConfig config= new GeoParserConfig();
 config.setGazetterPath(gazetteerPath);
 config.setNERModelPath(nerPath);
 context.set(GeoParserConfig.class, config);

 geoparser.parse(
 stream,
 new BodyContentHandler(),
 metadata,
 context);

for(String name: metadata.names()){
  String value=metadata.get(name);
  System.out.println(name +  + value);  
}
 }
 ```
 This parser generates useful geographical information to Tika's Metadata 
 Object. 
 Fields for best matched location:
 ```
 Geographic_NAME
 Geographic_LONGTITUDE
 Geographic_LATITUDE
 ```
 Fields for alternatives:
 ```
 Geographic_NAME1
 Geographic_LONGTITUDE1
 Geographic_LATITUDE1
 Geographic_NAME2
 Geographic_LONGTITUDE2
 Geographic_LATITUDE2
 ...
 ```
 If you have any questions, contact me: anyayu...@gmail.com



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1614) Geo Topic Parser

2015-04-23 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14510365#comment-14510365
 ] 

Nick Burch commented on TIKA-1614:
--

Do we really need to pull in all of Apache Lucene to make this work? Normally 
Lucene users depend on Tika, not the other way around!

There's also a lot of chunky binary data in the patch - any chance you could 
explain what it is, why it's there, how it was generated, how someone could 
make fixes to it etc?

 Geo Topic Parser
 

 Key: TIKA-1614
 URL: https://issues.apache.org/jira/browse/TIKA-1614
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Anya Yun Li
  Labels: memex

 ##Description
 This program aims to provide the support to identify geonames for any 
 unstructured text data in the project NSF polar research. 
 https://github.com/NSF-Polar-Cyberinfrastructure/datavis-hackathon/issues/1
 This project is a content-based geotagging solution, made of a variaty of NLP 
 tools and could be used for any geotagging purposes. 
 ##Workingflow
 1. Plain text input is passed to geoparser
 2. Location names are extracted from the text using OpenNLP NER
 3. Provide two roles: 
   * The most frequent location name choosed as the best match for the 
 input text
   * Other extracted locations are treated as alternatives (equal)
 4. location extracted above, search the best GeoName object and return the 
 resloved objects with fields (name in gazetteer, longitude, latitude)
 ##How to Use
 *Cautions*: This program requires at least 1.2 GB disk space for building 
 Lucene Index
 ```Java
   function A(stream){
   Metadata metadata = new Metadata();
 ParseContext context=new ParseContext();
 GeoParserConfig config= new GeoParserConfig();
 config.setGazetterPath(gazetteerPath);
 config.setNERModelPath(nerPath);
 context.set(GeoParserConfig.class, config);

 geoparser.parse(
 stream,
 new BodyContentHandler(),
 metadata,
 context);

for(String name: metadata.names()){
  String value=metadata.get(name);
  System.out.println(name +  + value);  
}
 }
 ```
 This parser generates useful geographical information to Tika's Metadata 
 Object. 
 Fields for best matched location:
 ```
 Geographic_NAME
 Geographic_LONGTITUDE
 Geographic_LATITUDE
 ```
 Fields for alternatives:
 ```
 Geographic_NAME1
 Geographic_LONGTITUDE1
 Geographic_LATITUDE1
 Geographic_NAME2
 Geographic_LONGTITUDE2
 Geographic_LATITUDE2
 ...
 ```
 If you have any questions, contact me: anyayu...@gmail.com



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1598) Parser Implementation for Streaming Video

2015-04-23 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14510371#comment-14510371
 ] 

Nick Burch commented on TIKA-1598:
--

[~rgauss] already maintains support for wrapping FFMpeg for use in Tika at 
https://github.com/AlfrescoLabs/tika-ffmpeg based on the ExternalParser support 
- is it possible to re-use / extend that for this additional use-case?

 Parser Implementation for Streaming Video
 -

 Key: TIKA-1598
 URL: https://issues.apache.org/jira/browse/TIKA-1598
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
  Labels: memex
 Fix For: 1.9


 A number of us have been discussing a Tika implementation which could, for 
 example, bind to a live multimedia stream and parse content from the stream 
 until it finished.
 An excellent example would be watching Bonnie Scotland beating R. of Ireland 
 in the upcoming European Championship Qualifying - Group D on Sat 13 Jun @ 
 17:00 GMT :)
 I located a JMF Wrapper for ffmpeg which 'may' enable us to do this
 http://sourceforge.net/projects/jffmpeg/
 I am not sure... plus it is not licensed liberally enough for us to include 
 so if there are other implementations then please post them here.
 I 'may' be able to have a crack at implementing this next week.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1610) CBOR Parser and detection [improvement]

2015-04-23 Thread Luke sh (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14510402#comment-14510402
 ] 

Luke sh commented on TIKA-1610:
---

Thanks a lot [~gagravarr] for the prompt response.
I thought it would be probably be risky if we discard any one of the estimated 
types because of the magic priority (one is higher than the other, i wanted 
tika to rely on the extension when there is a tie to break.

For now, in this particular case, i also cannot think of any reason why we 
don't use 60, might be i am too skeptical.

Thanks


 CBOR Parser and detection [improvement]
 ---

 Key: TIKA-1610
 URL: https://issues.apache.org/jira/browse/TIKA-1610
 Project: Tika
  Issue Type: New Feature
  Components: detector, mime, parser
Affects Versions: 1.7
Reporter: Luke sh
Assignee: Chris A. Mattmann
Priority: Trivial
  Labels: memex
 Attachments: 142440269.html, NUTCH-1997.cbor, 
 cbor_tika.mimetypes.xml.jpg, rfc_cbor.jpg


 CBOR is a data format whose design goals include the possibility of extremely 
 small code size, fairly small message size, and extensibility without the 
 need for version negotiation (cited from http://cbor.io/ ).
 It would be great if Tika is able to provide the support with CBOR parser and 
 identification. In the current project with Nutch, the Nutch 
 CommonCrawlDataDumper is used to dump the crawled segments to the files in 
 the format of CBOR. In order to read/parse those dumped files by this tool, 
 it would be great if tika is able to support parsing the cbor, the thing is 
 that the CommonCrawlDataDumper is not dumping with correct extension, it 
 dumps with its own rule, the default extension of the dumped file is html, so 
 it might be less painful if tika is able to detect and parse those files 
 without any pre-processing steps. 
 CommonCrawlDataDumper is calling the following to dump with cbor.
 import com.fasterxml.jackson.dataformat.cbor.CBORFactory;
 import com.fasterxml.jackson.dataformat.cbor.CBORGenerator;
 fasterxml is a 3rd party library for converting json to .cbor and Vice Versa.
 According to RFC 7049 (http://tools.ietf.org/html/rfc7049), it looks like 
 CBOR does not yet have its magic numbers to be detected/identified by other 
 applications (PFA: rfc_cbor.jpg)
 It seems that the only way to inform other applications of the type as of now 
 is using the extension (i.e. .cbor), or probably content detection (i.e. byte 
 histogram distribution estimation).  
 There is another thing worth the attention, it looks like tika has attempted 
 to add the support with cbor mime detection in the tika-mimetypes.xml 
 (PFA:cbor_tika.mimetypes.xml.jpg); This detection is not working with the 
 cbor file dumped by CommonCrawlDataDumper. 
 According to http://tools.ietf.org/html/rfc7049#section-2.4.5, there is a 
 self-describing Tag 55799 that seems to be used for cbor type 
 identification(the hex code might be 0xd9d9f7), but it is probably up to the 
 application that take care of this tag, and it is also possible that the 
 fasterxml that the nutch dumping tool is missing this tag, an example cbor 
 file dumped by the Nutch tool i.e. CommonCrawlDataDumper has also been 
 attached (PFA: 142440269.html).
 The following info is cited from the rfc, ...a decoder might be able to 
 parse both CBOR and JSON.
Such a decoder would need to mechanically distinguish the two
formats.  An easy way for an encoder to help the decoder would be to
tag the entire CBOR item with tag 55799, the serialization of which
will never be found at the beginning of a JSON text...
 It looks like the a file can have two parts/sections i.e. the plain text 
 parts and the json prettified by cbor, this might be also worth the attention 
 and consideration with the parsing and type identification.
 On the other hand, it is worth noting that the entries for cbor extension 
 detection needs to be appended in the tika-mimetypes.xml too 
 e.g.
 glob pattern=*.cbor/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1610) CBOR Parser and detection [improvement]

2015-04-23 Thread Luke sh (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luke sh updated TIKA-1610:
--
Attachment: NUTCH-1997.cbor

 CBOR Parser and detection [improvement]
 ---

 Key: TIKA-1610
 URL: https://issues.apache.org/jira/browse/TIKA-1610
 Project: Tika
  Issue Type: New Feature
  Components: detector, mime, parser
Affects Versions: 1.7
Reporter: Luke sh
Assignee: Chris A. Mattmann
Priority: Trivial
  Labels: memex
 Attachments: 142440269.html, NUTCH-1997.cbor, 
 cbor_tika.mimetypes.xml.jpg, rfc_cbor.jpg


 CBOR is a data format whose design goals include the possibility of extremely 
 small code size, fairly small message size, and extensibility without the 
 need for version negotiation (cited from http://cbor.io/ ).
 It would be great if Tika is able to provide the support with CBOR parser and 
 identification. In the current project with Nutch, the Nutch 
 CommonCrawlDataDumper is used to dump the crawled segments to the files in 
 the format of CBOR. In order to read/parse those dumped files by this tool, 
 it would be great if tika is able to support parsing the cbor, the thing is 
 that the CommonCrawlDataDumper is not dumping with correct extension, it 
 dumps with its own rule, the default extension of the dumped file is html, so 
 it might be less painful if tika is able to detect and parse those files 
 without any pre-processing steps. 
 CommonCrawlDataDumper is calling the following to dump with cbor.
 import com.fasterxml.jackson.dataformat.cbor.CBORFactory;
 import com.fasterxml.jackson.dataformat.cbor.CBORGenerator;
 fasterxml is a 3rd party library for converting json to .cbor and Vice Versa.
 According to RFC 7049 (http://tools.ietf.org/html/rfc7049), it looks like 
 CBOR does not yet have its magic numbers to be detected/identified by other 
 applications (PFA: rfc_cbor.jpg)
 It seems that the only way to inform other applications of the type as of now 
 is using the extension (i.e. .cbor), or probably content detection (i.e. byte 
 histogram distribution estimation).  
 There is another thing worth the attention, it looks like tika has attempted 
 to add the support with cbor mime detection in the tika-mimetypes.xml 
 (PFA:cbor_tika.mimetypes.xml.jpg); This detection is not working with the 
 cbor file dumped by CommonCrawlDataDumper. 
 According to http://tools.ietf.org/html/rfc7049#section-2.4.5, there is a 
 self-describing Tag 55799 that seems to be used for cbor type 
 identification(the hex code might be 0xd9d9f7), but it is probably up to the 
 application that take care of this tag, and it is also possible that the 
 fasterxml that the nutch dumping tool is missing this tag, an example cbor 
 file dumped by the Nutch tool i.e. CommonCrawlDataDumper has also been 
 attached (PFA: 142440269.html).
 The following info is cited from the rfc, ...a decoder might be able to 
 parse both CBOR and JSON.
Such a decoder would need to mechanically distinguish the two
formats.  An easy way for an encoder to help the decoder would be to
tag the entire CBOR item with tag 55799, the serialization of which
will never be found at the beginning of a JSON text...
 It looks like the a file can have two parts/sections i.e. the plain text 
 parts and the json prettified by cbor, this might be also worth the attention 
 and consideration with the parsing and type identification.
 On the other hand, it is worth noting that the entries for cbor extension 
 detection needs to be appended in the tika-mimetypes.xml too 
 e.g.
 glob pattern=*.cbor/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1610) CBOR Parser and detection [improvement]

2015-04-23 Thread Luke sh (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14510382#comment-14510382
 ] 

Luke sh edited comment on TIKA-1610 at 4/24/15 2:43 AM:


Notes:
The attached cbor file(i.e.NUTCH-1997.cbor) contains both magic bytes for type 
xhtml and type cbor, with priority 40 on application/cbor, we will have the 
following issues

Problem1: Magic priority 40.
The application/xhtml+xml has higher priority(50) than application/cbor (40); 
[I don't know who (and why) assigned 40 to cbor]; So if xhtml gets read and 
compared first, cbor will not even be placed in the magic estimation list 
because it has low priority. Based on the tests, it turns out that it is true 
that xhtml gets read and compared first with the input file, so any type below 
the priority 50 will be disregarded.

Problem2: again magic priority with 50.
In Tika, given a file dumped by the nutch dumper tool, both types (xhtml and 
cbor) will be selected as candidate mime types and they will be put in the 
magic estimation list; since xhtml type gets read first, it is placed atop the 
cbor; in order to break that tie, tika will rely on the decision from the 
extension method. If the extension method fails to detect the type(for now, 
let's ignore metadata hint method for simplicity but the same applies to it 
too), then xhtml gets returned eventually.

My pull request to be sent : I am going to set the magic priority of cbor type 
to 50 the same as xhtml, because it would probably be risky to discard any one 
of the estimated types without going consult the extension method.



was (Author: lukeliush):
Notes:
The attached cbor file contains both magic bytes for type xhtml and type cbor, 
with priority 40 on application/cbor, we will have the following issues

Problem1: Magic priority 40.
The application/xhtml+xml has higher priority(50) than application/cbor (40); 
[I don't know who (and why) assigned 40 to cbor]; So if xhtml gets read and 
compared first, cbor will not even be placed in the magic estimation list 
because it has low priority. Based on the tests, it turns out that it is true 
that xhtml gets read and compared first with the input file, so any type below 
the priority 50 will be disregarded.

Problem2: again magic priority with 50.
In Tika, given a file dumped by the nutch dumper tool, both types (xhtml and 
cbor) will be selected as candidate mime types and they will be put in the 
magic estimation list; since xhtml type gets read first, it is placed atop the 
cbor; in order to break that tie, tika will rely on the decision from the 
extension method. If the extension method fails to detect the type(for now, 
let's ignore metadata hint method for simplicity but the same applies to it 
too), then xhtml gets returned eventually.

My pull request to be sent : I am going to set the magic priority of cbor type 
to 50 the same as xhtml, because it would probably be risky to discard any one 
of the estimated types without going consult the extension method.


 CBOR Parser and detection [improvement]
 ---

 Key: TIKA-1610
 URL: https://issues.apache.org/jira/browse/TIKA-1610
 Project: Tika
  Issue Type: New Feature
  Components: detector, mime, parser
Affects Versions: 1.7
Reporter: Luke sh
Assignee: Chris A. Mattmann
Priority: Trivial
  Labels: memex
 Attachments: 142440269.html, NUTCH-1997.cbor, 
 cbor_tika.mimetypes.xml.jpg, rfc_cbor.jpg


 CBOR is a data format whose design goals include the possibility of extremely 
 small code size, fairly small message size, and extensibility without the 
 need for version negotiation (cited from http://cbor.io/ ).
 It would be great if Tika is able to provide the support with CBOR parser and 
 identification. In the current project with Nutch, the Nutch 
 CommonCrawlDataDumper is used to dump the crawled segments to the files in 
 the format of CBOR. In order to read/parse those dumped files by this tool, 
 it would be great if tika is able to support parsing the cbor, the thing is 
 that the CommonCrawlDataDumper is not dumping with correct extension, it 
 dumps with its own rule, the default extension of the dumped file is html, so 
 it might be less painful if tika is able to detect and parse those files 
 without any pre-processing steps. 
 CommonCrawlDataDumper is calling the following to dump with cbor.
 import com.fasterxml.jackson.dataformat.cbor.CBORFactory;
 import com.fasterxml.jackson.dataformat.cbor.CBORGenerator;
 fasterxml is a 3rd party library for converting json to .cbor and Vice Versa.
 According to RFC 7049 (http://tools.ietf.org/html/rfc7049), it looks like 
 CBOR does not yet have its magic numbers to be detected/identified by other 
 applications (PFA: rfc_cbor.jpg)
 It seems that 

[jira] [Commented] (TIKA-1610) CBOR Parser and detection [improvement]

2015-04-23 Thread Luke sh (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14510382#comment-14510382
 ] 

Luke sh commented on TIKA-1610:
---

Notes:
The attached cbor file contains both magic bytes for type xhtml and type cbor, 
with priority 40 on application/cbor, we will have the following issues

Problem1: Magic priority 40.
The application/xhtml+xml has higher priority(50) than application/cbor (40); 
[I don't know who (and why) assigned 40 to cbor]; So if xhtml gets read and 
compared first, cbor will not even be placed in the magic estimation list 
because it has low priority. Based on the tests, it turns out that it is true 
that xhtml gets read and compared first with the input file, so any type below 
the priority 50 will be disregarded.

Problem2: again magic priority with 50.
In Tika, given a file dumped by the nutch dumper tool, both types (xhtml and 
cbor) will be selected as candidate mime types and they will be put in the 
magic estimation list; since xhtml type gets read first, it is placed atop the 
cbor; in order to break that tie, tika will rely on the decision from the 
extension method. If the extension method fails to detect the type(for now, 
let's ignore metadata hint method for simplicity but the same applies to it 
too), then xhtml gets returned eventually.

My pull request to be sent : I am going to set the magic priority of cbor type 
to 50 the same as xhtml, because it would probably be risky to discard any one 
of the estimated types without going consult the extension method.


 CBOR Parser and detection [improvement]
 ---

 Key: TIKA-1610
 URL: https://issues.apache.org/jira/browse/TIKA-1610
 Project: Tika
  Issue Type: New Feature
  Components: detector, mime, parser
Affects Versions: 1.7
Reporter: Luke sh
Assignee: Chris A. Mattmann
Priority: Trivial
  Labels: memex
 Attachments: 142440269.html, NUTCH-1997.cbor, 
 cbor_tika.mimetypes.xml.jpg, rfc_cbor.jpg


 CBOR is a data format whose design goals include the possibility of extremely 
 small code size, fairly small message size, and extensibility without the 
 need for version negotiation (cited from http://cbor.io/ ).
 It would be great if Tika is able to provide the support with CBOR parser and 
 identification. In the current project with Nutch, the Nutch 
 CommonCrawlDataDumper is used to dump the crawled segments to the files in 
 the format of CBOR. In order to read/parse those dumped files by this tool, 
 it would be great if tika is able to support parsing the cbor, the thing is 
 that the CommonCrawlDataDumper is not dumping with correct extension, it 
 dumps with its own rule, the default extension of the dumped file is html, so 
 it might be less painful if tika is able to detect and parse those files 
 without any pre-processing steps. 
 CommonCrawlDataDumper is calling the following to dump with cbor.
 import com.fasterxml.jackson.dataformat.cbor.CBORFactory;
 import com.fasterxml.jackson.dataformat.cbor.CBORGenerator;
 fasterxml is a 3rd party library for converting json to .cbor and Vice Versa.
 According to RFC 7049 (http://tools.ietf.org/html/rfc7049), it looks like 
 CBOR does not yet have its magic numbers to be detected/identified by other 
 applications (PFA: rfc_cbor.jpg)
 It seems that the only way to inform other applications of the type as of now 
 is using the extension (i.e. .cbor), or probably content detection (i.e. byte 
 histogram distribution estimation).  
 There is another thing worth the attention, it looks like tika has attempted 
 to add the support with cbor mime detection in the tika-mimetypes.xml 
 (PFA:cbor_tika.mimetypes.xml.jpg); This detection is not working with the 
 cbor file dumped by CommonCrawlDataDumper. 
 According to http://tools.ietf.org/html/rfc7049#section-2.4.5, there is a 
 self-describing Tag 55799 that seems to be used for cbor type 
 identification(the hex code might be 0xd9d9f7), but it is probably up to the 
 application that take care of this tag, and it is also possible that the 
 fasterxml that the nutch dumping tool is missing this tag, an example cbor 
 file dumped by the Nutch tool i.e. CommonCrawlDataDumper has also been 
 attached (PFA: 142440269.html).
 The following info is cited from the rfc, ...a decoder might be able to 
 parse both CBOR and JSON.
Such a decoder would need to mechanically distinguish the two
formats.  An easy way for an encoder to help the decoder would be to
tag the entire CBOR item with tag 55799, the serialization of which
will never be found at the beginning of a JSON text...
 It looks like the a file can have two parts/sections i.e. the plain text 
 parts and the json prettified by cbor, this might be also worth the attention 
 and consideration with 

[jira] [Commented] (TIKA-1610) CBOR Parser and detection [improvement]

2015-04-23 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14510394#comment-14510394
 ] 

Nick Burch commented on TIKA-1610:
--

Based on that, I think the CBOR mime magic needs to be higher than the (x)html 
one, not lower and not the same. So, in r1675755. I've set it to 60 and added 
detection unit tests. These tests failed before the bump from 40 to 60, so I 
think we're in a better place now!

 CBOR Parser and detection [improvement]
 ---

 Key: TIKA-1610
 URL: https://issues.apache.org/jira/browse/TIKA-1610
 Project: Tika
  Issue Type: New Feature
  Components: detector, mime, parser
Affects Versions: 1.7
Reporter: Luke sh
Assignee: Chris A. Mattmann
Priority: Trivial
  Labels: memex
 Attachments: 142440269.html, NUTCH-1997.cbor, 
 cbor_tika.mimetypes.xml.jpg, rfc_cbor.jpg


 CBOR is a data format whose design goals include the possibility of extremely 
 small code size, fairly small message size, and extensibility without the 
 need for version negotiation (cited from http://cbor.io/ ).
 It would be great if Tika is able to provide the support with CBOR parser and 
 identification. In the current project with Nutch, the Nutch 
 CommonCrawlDataDumper is used to dump the crawled segments to the files in 
 the format of CBOR. In order to read/parse those dumped files by this tool, 
 it would be great if tika is able to support parsing the cbor, the thing is 
 that the CommonCrawlDataDumper is not dumping with correct extension, it 
 dumps with its own rule, the default extension of the dumped file is html, so 
 it might be less painful if tika is able to detect and parse those files 
 without any pre-processing steps. 
 CommonCrawlDataDumper is calling the following to dump with cbor.
 import com.fasterxml.jackson.dataformat.cbor.CBORFactory;
 import com.fasterxml.jackson.dataformat.cbor.CBORGenerator;
 fasterxml is a 3rd party library for converting json to .cbor and Vice Versa.
 According to RFC 7049 (http://tools.ietf.org/html/rfc7049), it looks like 
 CBOR does not yet have its magic numbers to be detected/identified by other 
 applications (PFA: rfc_cbor.jpg)
 It seems that the only way to inform other applications of the type as of now 
 is using the extension (i.e. .cbor), or probably content detection (i.e. byte 
 histogram distribution estimation).  
 There is another thing worth the attention, it looks like tika has attempted 
 to add the support with cbor mime detection in the tika-mimetypes.xml 
 (PFA:cbor_tika.mimetypes.xml.jpg); This detection is not working with the 
 cbor file dumped by CommonCrawlDataDumper. 
 According to http://tools.ietf.org/html/rfc7049#section-2.4.5, there is a 
 self-describing Tag 55799 that seems to be used for cbor type 
 identification(the hex code might be 0xd9d9f7), but it is probably up to the 
 application that take care of this tag, and it is also possible that the 
 fasterxml that the nutch dumping tool is missing this tag, an example cbor 
 file dumped by the Nutch tool i.e. CommonCrawlDataDumper has also been 
 attached (PFA: 142440269.html).
 The following info is cited from the rfc, ...a decoder might be able to 
 parse both CBOR and JSON.
Such a decoder would need to mechanically distinguish the two
formats.  An easy way for an encoder to help the decoder would be to
tag the entire CBOR item with tag 55799, the serialization of which
will never be found at the beginning of a JSON text...
 It looks like the a file can have two parts/sections i.e. the plain text 
 parts and the json prettified by cbor, this might be also worth the attention 
 and consideration with the parsing and type identification.
 On the other hand, it is worth noting that the entries for cbor extension 
 detection needs to be appended in the tika-mimetypes.xml too 
 e.g.
 glob pattern=*.cbor/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-1617) Change OSGi Detection test to use OSGi Service

2015-04-23 Thread Bob Paulin (JIRA)
Bob Paulin created TIKA-1617:


 Summary: Change OSGi Detection test to use OSGi Service
 Key: TIKA-1617
 URL: https://issues.apache.org/jira/browse/TIKA-1617
 Project: Tika
  Issue Type: Test
Reporter: Bob Paulin
Priority: Minor


Currently the testDetection test does not actually use the OSGi service created 
within the OSGi Framework.  I've changed the test to use the service defined in 
the tika-bundle



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)