[jira] [Commented] (TIKA-1533) PDF parse failing to capture right order of text (2 columns)

2015-01-28 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14295159#comment-14295159
 ] 

Tim Allison commented on TIKA-1533:
---

In the first document, printed page 303/pdf page 152 contains Tabell 5.7 - 
Tabell 5.9?  I only see 362 on printed page 362 and in sammanlagt 362 
frågor on printed page 88, pdf page 45.

Have you run straight PDFBox's app with ExtractText to see if that is having 
the same issue as Tika?

 PDF parse failing to capture right order of text (2 columns)
 

 Key: TIKA-1533
 URL: https://issues.apache.org/jira/browse/TIKA-1533
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.6, 1.7
 Environment: Java 8, Mac OS X
Reporter: Tamara

 When I am converting a document with two columns the order of the columns are 
 inverted in the text file. I only could notice because it is an index list. 
 The page I start to see the problem is the page 303, to look in the converted 
 text look for 362. In the second file I have the same problem the page is 341.
 I have tried: setSortByPosition(true) and the columns got scrambled.
 I have tried to copy and paste from the pdf preview and the copy is as it 
 should.
 And I have tried to use PDFXStream and it parses in the right way.
 Here are the files I have seen the issue:
 http://www.sbu.se/upload/Publikationer/Content0/1/Autismspektrumtillst%C3%A5nd_fulltext.pdf
 http://www.sbu.se/upload/publikationer/content0/1/forstamningssyndrom_fulltext.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-1533) PDF parse failing to capture right order of text (2 columns)

2015-01-28 Thread Tamara (JIRA)
Tamara created TIKA-1533:


 Summary: PDF parse failing to capture right order of text (2 
columns)
 Key: TIKA-1533
 URL: https://issues.apache.org/jira/browse/TIKA-1533
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.7, 1.6
 Environment: Java 8, Mac OS X
Reporter: Tamara


When I am converting a document with two columns the order of the columns are 
inverted in the text file. I only could notice because it is an index list. The 
page I start to see the problem is the page 303, to look in the converted text 
look for 362. In the second file I have the same problem the page is 341.

I have tried: setSortByPosition(true) and the columns got scrambled.

I have tried to copy and paste from the pdf preview and the copy is as it 
should.

And I have tried to use PDFXStream and it parses in the right way.

Here are the files I have seen the issue:
http://www.sbu.se/upload/Publikationer/Content0/1/Autismspektrumtillst%C3%A5nd_fulltext.pdf

http://www.sbu.se/upload/publikationer/content0/1/forstamningssyndrom_fulltext.pdf





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1517) MIME type selection with probability

2015-01-28 Thread Luke sh (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14295928#comment-14295928
 ] 

Luke sh commented on TIKA-1517:
---

the probability selection will inherit the class MIMETypes, which needs to be 
modified by exposing some of its methods for being able to inherit and 
overwrite, TIKA-1535.

 MIME type selection with probability
 

 Key: TIKA-1517
 URL: https://issues.apache.org/jira/browse/TIKA-1517
 Project: Tika
  Issue Type: Improvement
  Components: mime
Affects Versions: 0.1-incubating, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 
 0.10, 1.0, 1.1, 1.2, 1.3, 1.4, 1.5, 1.6
Reporter: Luke sh
 Attachments: BaysianTest.java


 Improvement and intuition
 The original implementation for MIME type selection/detection is a bit less 
 flexible by initial design, as it heavily relies on the outcome produced by 
 magic-bytes MIME Type identification; Thus e.g. if magic-bytes is applicable 
 in a file, Tika will follow the file type detected by magic-bytes. It may be 
 better to provide more control over the method of choice.
 This proposed approach slightly incorporate the Bayesian probability theorem, 
 where users are able to assign weights to each approach in terms of 
 probability, so they have the control over preference of which file type or 
 mime type identification methods implemented/available in Tika, and currently 
 there are 3 methods for identifying MIME type in Tika (i.e. Magic-Bytes, File 
 extension and Metadata content-type hint). By introducing some weights on the 
 approach in the proposed approach, users are able to choose which method they 
 trust most, the magic-bytes method is often trust-worthy though. But the 
 virtue is that in some situations, file type identification must be 
 sensitive, some might want all of the MIME type identification methods to 
 agree on the same file type before they start processing those files, 
 incorrect file type identification is less intolerable. The current 
 implementation seems to be less flexible for this purpose and heavily rely on 
 the Magic-bytes file identification method (although magic-bytes is most 
 reliable compared to the other 2 ); 
 Proposed design:
 The idea of selection is to incorporate probability as weights on each MIME 
 type identification method currently being implemented in Tika (they are 
 Magic bytes approach, file extension match and metadata content-type hint).
 for example,
 as an user, i would probably like to assign the the preference to the method 
 based on the degree of the trust, and order the results if they don't 
 coincide.
 Bayesian rule may be a bit appropriate here to meet the intuition.
 The following is what are needed for Bayesian rule implementation.
  Prior probability P(file_type) e.g. P(pdf), theoretically this is computed 
  based on the samples, and this depends on the domain or use cases, 
  intuitively we more care the orders of the weights or probability of the 
  results rather than the actual numbers, and also the context of Prior 
  depends on samples for a particular use case or domain, e.g. if we happen 
  to crawl a website that contains mostly the pdf files, we probably can 
  collect some samples and compute the prior, based on the samples we can say 
  90% of docs are pdf, so our prior is defined to be P(pdf) = 0.9, but here 
  we propose to define the prior as configurable param for users, and by 
  default we leave the prior to be unapplicable. Alternatively, we can 
  define prior for each file type to be  1/[number of supported file types in 
  Tika] I think the number would be approximately 1/1157 and using this 
  number seems to be more fair, but the point of avoiding it is that this 
  prior is fixed for every type, and eventually we care more the orders of 
  the result and if the number is fixed, so will the order be, bringing this 
  number of 1/1157 into the Bayesian equation will not only be unable to 
  affect the order but also it will lumber our implementation with extra 
  computation, thus we will leave it as unapplicable which means we assign 
  1 to it as it never exists! but note we care more the order rather the 
  actual number, and this param is configurable, and we believe it provides 
  much flexibilities in some use cases.
  Conditional probability of positive tests given a file type P(test| 
  file_type) e.g. P(test1 = pdf | pdf), this probability is also based on 
  collection of samples and domain or use cases, we leave it configurable, 
  but based on our intuition we think test1(i.e. Magic-bytes method) is most 
  trustworthy, thus the default value is 0.75 for P(test1 = a_file_type | 
  a_file_type), this is to say given the file whose type is a file type, 
  the probability of the test1 predicting the file is a_file_type is 

[jira] [Commented] (TIKA-1535) Inheritance modification for the class MIMETypes

2015-01-28 Thread Luke sh (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14295922#comment-14295922
 ] 

Luke sh commented on TIKA-1535:
---

TIKA-1517, the mime type selection mechanism with probability will be 
implemented by inheriting this class MIMEtypes, and MIMETypes is currently 
defined with final and some of its methods are tied with private modifier 
which does not allow for overwriting.  



 Inheritance modification for the class MIMETypes
 

 Key: TIKA-1535
 URL: https://issues.apache.org/jira/browse/TIKA-1535
 Project: Tika
  Issue Type: Improvement
  Components: mime
Reporter: Luke sh
Priority: Trivial

 The Class MIMETypes does not currently allow for inheritance.
 There are a couple of methods in this class which looks independent, and some 
 of which needs to be exposed or overwritten for special needs or use cases, 
 this will enable tika users with more flexibility for new mime detection 
 algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1521) Handle password protected 7zip files

2015-01-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14295791#comment-14295791
 ] 

Hudson commented on TIKA-1521:
--

SUCCESS: Integrated in tika-trunk-jdk1.6 #442 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.6/442/])
TIKA-1521: follow commons-compress and require installation of jce before 
testing password on 7z file (tallison: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1655431)
* 
/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/pkg/PackageParser.java
* 
/tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/pkg/Seven7ParserTest.java


 Handle password protected 7zip files
 

 Key: TIKA-1521
 URL: https://issues.apache.org/jira/browse/TIKA-1521
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.7
Reporter: Nick Burch
 Fix For: 1.8


 While working on TIKA-1028, I notice that while Commons Compress doesn't 
 currently handle decrypting password protected zip files, it does handle 
 password protected 7zip files
 We should therefore add logic into the package parser to spot password 
 protected 7zip files, and fetch the password for them from a PasswordProvider 
 if given



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1534) Upgrade to Commons Compress 1.9

2015-01-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14295789#comment-14295789
 ] 

Hudson commented on TIKA-1534:
--

SUCCESS: Integrated in tika-trunk-jdk1.6 #442 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.6/442/])
TIKA-1534: Upgrade to Commons Compress 1.9 (tallison: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1655433)
* /tika/trunk/CHANGES.txt
* /tika/trunk/tika-parsers/pom.xml


 Upgrade to Commons Compress 1.9
 ---

 Key: TIKA-1534
 URL: https://issues.apache.org/jira/browse/TIKA-1534
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1517) MIME type selection with probability

2015-01-28 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14296103#comment-14296103
 ] 

Tyler Palsulich commented on TIKA-1517:
---

Hi [~Lukeliush]. Thanks for raising this idea! Have you tested this probability 
selection mechanism on a set of files? Did the Mime Type detection improve?

Instead of extending MimeTypes, you might be able to implement this as a 
standalone Detector which has an instance of MimeTypes it uses to get the Mime 
Type hints needed for your Bayesian system. Does that make sense?

 MIME type selection with probability
 

 Key: TIKA-1517
 URL: https://issues.apache.org/jira/browse/TIKA-1517
 Project: Tika
  Issue Type: Improvement
  Components: mime
Affects Versions: 0.1-incubating, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 
 0.10, 1.0, 1.1, 1.2, 1.3, 1.4, 1.5, 1.6
Reporter: Luke sh
 Attachments: BaysianTest.java


 Improvement and intuition
 The original implementation for MIME type selection/detection is a bit less 
 flexible by initial design, as it heavily relies on the outcome produced by 
 magic-bytes MIME Type identification; Thus e.g. if magic-bytes is applicable 
 in a file, Tika will follow the file type detected by magic-bytes. It may be 
 better to provide more control over the method of choice.
 This proposed approach slightly incorporate the Bayesian probability theorem, 
 where users are able to assign weights to each approach in terms of 
 probability, so they have the control over preference of which file type or 
 mime type identification methods implemented/available in Tika, and currently 
 there are 3 methods for identifying MIME type in Tika (i.e. Magic-Bytes, File 
 extension and Metadata content-type hint). By introducing some weights on the 
 approach in the proposed approach, users are able to choose which method they 
 trust most, the magic-bytes method is often trust-worthy though. But the 
 virtue is that in some situations, file type identification must be 
 sensitive, some might want all of the MIME type identification methods to 
 agree on the same file type before they start processing those files, 
 incorrect file type identification is less intolerable. The current 
 implementation seems to be less flexible for this purpose and heavily rely on 
 the Magic-bytes file identification method (although magic-bytes is most 
 reliable compared to the other 2 ); 
 Proposed design:
 The idea of selection is to incorporate probability as weights on each MIME 
 type identification method currently being implemented in Tika (they are 
 Magic bytes approach, file extension match and metadata content-type hint).
 for example,
 as an user, i would probably like to assign the the preference to the method 
 based on the degree of the trust, and order the results if they don't 
 coincide.
 Bayesian rule may be a bit appropriate here to meet the intuition.
 The following is what are needed for Bayesian rule implementation.
  Prior probability P(file_type) e.g. P(pdf), theoretically this is computed 
  based on the samples, and this depends on the domain or use cases, 
  intuitively we more care the orders of the weights or probability of the 
  results rather than the actual numbers, and also the context of Prior 
  depends on samples for a particular use case or domain, e.g. if we happen 
  to crawl a website that contains mostly the pdf files, we probably can 
  collect some samples and compute the prior, based on the samples we can say 
  90% of docs are pdf, so our prior is defined to be P(pdf) = 0.9, but here 
  we propose to define the prior as configurable param for users, and by 
  default we leave the prior to be unapplicable. Alternatively, we can 
  define prior for each file type to be  1/[number of supported file types in 
  Tika] I think the number would be approximately 1/1157 and using this 
  number seems to be more fair, but the point of avoiding it is that this 
  prior is fixed for every type, and eventually we care more the orders of 
  the result and if the number is fixed, so will the order be, bringing this 
  number of 1/1157 into the Bayesian equation will not only be unable to 
  affect the order but also it will lumber our implementation with extra 
  computation, thus we will leave it as unapplicable which means we assign 
  1 to it as it never exists! but note we care more the order rather the 
  actual number, and this param is configurable, and we believe it provides 
  much flexibilities in some use cases.
  Conditional probability of positive tests given a file type P(test| 
  file_type) e.g. P(test1 = pdf | pdf), this probability is also based on 
  collection of samples and domain or use cases, we leave it configurable, 
  but based on our intuition we think test1(i.e. Magic-bytes method) is 

[jira] [Comment Edited] (TIKA-1517) MIME type selection with probability

2015-01-28 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14296103#comment-14296103
 ] 

Tyler Palsulich edited comment on TIKA-1517 at 1/29/15 12:04 AM:
-

Hi [~Lukeliush]. Thanks for raising this idea! Have you tested this probability 
selection mechanism on a set of files? Did the Mime Type detection improve?

Instead of extending MimeTypes, you might be able to implement this as a 
standalone Detector. Your Detector could use an instance of MimeTypes to get 
the Mime Type hints needed for your Bayesian system. Does that make sense?


was (Author: tpalsulich):
Hi [~Lukeliush]. Thanks for raising this idea! Have you tested this probability 
selection mechanism on a set of files? Did the Mime Type detection improve?

Instead of extending MimeTypes, you might be able to implement this as a 
standalone Detector which has an instance of MimeTypes it uses to get the Mime 
Type hints needed for your Bayesian system. Does that make sense?

 MIME type selection with probability
 

 Key: TIKA-1517
 URL: https://issues.apache.org/jira/browse/TIKA-1517
 Project: Tika
  Issue Type: Improvement
  Components: mime
Affects Versions: 0.1-incubating, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 
 0.10, 1.0, 1.1, 1.2, 1.3, 1.4, 1.5, 1.6
Reporter: Luke sh
 Attachments: BaysianTest.java


 Improvement and intuition
 The original implementation for MIME type selection/detection is a bit less 
 flexible by initial design, as it heavily relies on the outcome produced by 
 magic-bytes MIME Type identification; Thus e.g. if magic-bytes is applicable 
 in a file, Tika will follow the file type detected by magic-bytes. It may be 
 better to provide more control over the method of choice.
 This proposed approach slightly incorporate the Bayesian probability theorem, 
 where users are able to assign weights to each approach in terms of 
 probability, so they have the control over preference of which file type or 
 mime type identification methods implemented/available in Tika, and currently 
 there are 3 methods for identifying MIME type in Tika (i.e. Magic-Bytes, File 
 extension and Metadata content-type hint). By introducing some weights on the 
 approach in the proposed approach, users are able to choose which method they 
 trust most, the magic-bytes method is often trust-worthy though. But the 
 virtue is that in some situations, file type identification must be 
 sensitive, some might want all of the MIME type identification methods to 
 agree on the same file type before they start processing those files, 
 incorrect file type identification is less intolerable. The current 
 implementation seems to be less flexible for this purpose and heavily rely on 
 the Magic-bytes file identification method (although magic-bytes is most 
 reliable compared to the other 2 ); 
 Proposed design:
 The idea of selection is to incorporate probability as weights on each MIME 
 type identification method currently being implemented in Tika (they are 
 Magic bytes approach, file extension match and metadata content-type hint).
 for example,
 as an user, i would probably like to assign the the preference to the method 
 based on the degree of the trust, and order the results if they don't 
 coincide.
 Bayesian rule may be a bit appropriate here to meet the intuition.
 The following is what are needed for Bayesian rule implementation.
  Prior probability P(file_type) e.g. P(pdf), theoretically this is computed 
  based on the samples, and this depends on the domain or use cases, 
  intuitively we more care the orders of the weights or probability of the 
  results rather than the actual numbers, and also the context of Prior 
  depends on samples for a particular use case or domain, e.g. if we happen 
  to crawl a website that contains mostly the pdf files, we probably can 
  collect some samples and compute the prior, based on the samples we can say 
  90% of docs are pdf, so our prior is defined to be P(pdf) = 0.9, but here 
  we propose to define the prior as configurable param for users, and by 
  default we leave the prior to be unapplicable. Alternatively, we can 
  define prior for each file type to be  1/[number of supported file types in 
  Tika] I think the number would be approximately 1/1157 and using this 
  number seems to be more fair, but the point of avoiding it is that this 
  prior is fixed for every type, and eventually we care more the orders of 
  the result and if the number is fixed, so will the order be, bringing this 
  number of 1/1157 into the Bayesian equation will not only be unable to 
  affect the order but also it will lumber our implementation with extra 
  computation, thus we will leave it as unapplicable which means we assign 
  1 to it as it never 

[jira] [Commented] (TIKA-1423) Build a parser to extract data from GRIB formats

2015-01-28 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14296129#comment-14296129
 ] 

Lewis John McGibbney commented on TIKA-1423:


I am working on this and think I have navigated the osgi + bundle problem we 
were having.
I've just released edu.ucar thredds-parent, cdm, netCDF4 and grib 4.5.4 
dependencies to Maven Central so I will try to complete this ticket again once 
the dependencies have flushed to the mirrors.


 Build a parser to extract data from GRIB formats
 

 Key: TIKA-1423
 URL: https://issues.apache.org/jira/browse/TIKA-1423
 Project: Tika
  Issue Type: New Feature
  Components: metadata, mime, parser
Affects Versions: 1.6
Reporter: Vineet Ghatge
Assignee: Vineet Ghatge
Priority: Critical
  Labels: features, newbie
 Fix For: 1.8

 Attachments: GRIBParsertest.java, GribParser.java, 
 NLDAS_FORA0125_H.A20130112.1200.002.grb, TIKA-1423.palsulich.120614.patch, 
 TIKA-1423.patch, fileName.html, gdas1.forecmwf.2014062612.grib2


 Arctic dataset contains a MIME format called GRIB -  General 
 Regularly­distributed information in Binary form 
 http://en.wikipedia.org/wiki/GRIB . GRIB is a well known data format which is 
 a concise data format used in meteorology to store historical and 
 weather data. There are 2 different types of the format ­ GRIB 0, GRIB 2.  
 The focus will be on GRIB 2 which is the most prevalent. Each GRIB record 
 intended for either transmission or storage contains a single parameter with 
 values located at an array of grid points, or represented as a set of 
 spectral coefficients, for a single level (or layer), encoded as a continuous 
 bit stream. Logical divisions of the record are designated as sections, 
 each of which provides control information and/or data. A GRIB record 
 consists of six sections, two of which are optional: 
  
 (0) Indicator Section 
 (1) Product Definition Section (PDS) 
 (2) Grid Description Section (GDS) ­ optional 
 (3) Bit Map Section (BMS) ­ optional 
 (4) Binary Data Section (BDS) 
 (5) '' (ASCII Characters)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1517) MIME type selection with probability

2015-01-28 Thread Luke sh (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14295928#comment-14295928
 ] 

Luke sh edited comment on TIKA-1517 at 1/28/15 11:06 PM:
-

the probability selection implementation will inherit the class MIMETypes, 
which needs to be modified by exposing some of its methods for being able to 
inherit and overwrite, TIKA-1535.


was (Author: lukeliush):
the probability selection will inherit the class MIMETypes, which needs to be 
modified by exposing some of its methods for being able to inherit and 
overwrite, TIKA-1535.

 MIME type selection with probability
 

 Key: TIKA-1517
 URL: https://issues.apache.org/jira/browse/TIKA-1517
 Project: Tika
  Issue Type: Improvement
  Components: mime
Affects Versions: 0.1-incubating, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 
 0.10, 1.0, 1.1, 1.2, 1.3, 1.4, 1.5, 1.6
Reporter: Luke sh
 Attachments: BaysianTest.java


 Improvement and intuition
 The original implementation for MIME type selection/detection is a bit less 
 flexible by initial design, as it heavily relies on the outcome produced by 
 magic-bytes MIME Type identification; Thus e.g. if magic-bytes is applicable 
 in a file, Tika will follow the file type detected by magic-bytes. It may be 
 better to provide more control over the method of choice.
 This proposed approach slightly incorporate the Bayesian probability theorem, 
 where users are able to assign weights to each approach in terms of 
 probability, so they have the control over preference of which file type or 
 mime type identification methods implemented/available in Tika, and currently 
 there are 3 methods for identifying MIME type in Tika (i.e. Magic-Bytes, File 
 extension and Metadata content-type hint). By introducing some weights on the 
 approach in the proposed approach, users are able to choose which method they 
 trust most, the magic-bytes method is often trust-worthy though. But the 
 virtue is that in some situations, file type identification must be 
 sensitive, some might want all of the MIME type identification methods to 
 agree on the same file type before they start processing those files, 
 incorrect file type identification is less intolerable. The current 
 implementation seems to be less flexible for this purpose and heavily rely on 
 the Magic-bytes file identification method (although magic-bytes is most 
 reliable compared to the other 2 ); 
 Proposed design:
 The idea of selection is to incorporate probability as weights on each MIME 
 type identification method currently being implemented in Tika (they are 
 Magic bytes approach, file extension match and metadata content-type hint).
 for example,
 as an user, i would probably like to assign the the preference to the method 
 based on the degree of the trust, and order the results if they don't 
 coincide.
 Bayesian rule may be a bit appropriate here to meet the intuition.
 The following is what are needed for Bayesian rule implementation.
  Prior probability P(file_type) e.g. P(pdf), theoretically this is computed 
  based on the samples, and this depends on the domain or use cases, 
  intuitively we more care the orders of the weights or probability of the 
  results rather than the actual numbers, and also the context of Prior 
  depends on samples for a particular use case or domain, e.g. if we happen 
  to crawl a website that contains mostly the pdf files, we probably can 
  collect some samples and compute the prior, based on the samples we can say 
  90% of docs are pdf, so our prior is defined to be P(pdf) = 0.9, but here 
  we propose to define the prior as configurable param for users, and by 
  default we leave the prior to be unapplicable. Alternatively, we can 
  define prior for each file type to be  1/[number of supported file types in 
  Tika] I think the number would be approximately 1/1157 and using this 
  number seems to be more fair, but the point of avoiding it is that this 
  prior is fixed for every type, and eventually we care more the orders of 
  the result and if the number is fixed, so will the order be, bringing this 
  number of 1/1157 into the Bayesian equation will not only be unable to 
  affect the order but also it will lumber our implementation with extra 
  computation, thus we will leave it as unapplicable which means we assign 
  1 to it as it never exists! but note we care more the order rather the 
  actual number, and this param is configurable, and we believe it provides 
  much flexibilities in some use cases.
  Conditional probability of positive tests given a file type P(test| 
  file_type) e.g. P(test1 = pdf | pdf), this probability is also based on 
  collection of samples and domain or use cases, we leave it configurable, 
  but based on our intuition 

RE: [jira] [Commented] (TIKA-1535) Inheritance modification for the class MIMETypes

2015-01-28 Thread Luke
Hi Professor and all, 

Bayesian or machine learning Detector is different from Bayesian Selection 
mechanism reported in TIKA-1517.
It would make sense if we implemented a machine learning algorithm in separate 
Detector class, I have not gone too far with this design thought, as I am still 
on the stage of the research with data collection, once I have enough data and 
am able to form a model, especially I am able to prove my concept, then I will 
be able to come down to the machine learning detector implementation with 
design consideration. (BTW, I think I have some ideas with data collection and 
training, it still takes some time to come up with something even quick and 
dirty that can prove the concept with machine learning, I am still working on 
the data collection, there are also some design problems within learning 
techniques too, I will come to them once I will have clear idea with the data, 
i think I may have to crawl the data and label them for training, there are 
some certain preprocessing steps to be cared too)

However, my current implementation in TIKA-1517 is solely based on mime type 
selection(I cannot find any clearer name disguisable from detection) with 
probability that might have nothing to do with the genuine machine learning 
detector, it is a feature for adding weights to each Tika mime type detection 
algorithm. 

But I think you are right, and in the future we kinda need it to assign weights 
to a pool of detection algorithms including machine learning techniques or 
content based detection algorithms, and the current implementation of MIMEtypes 
with final has its design purpose, and I don’t think it is a good idea to lump 
detector code within the MimeTypes, but I will come down to this design or 
architecture problem once I have some clear ideas of the machine learning model 
(not necessary Bayesian model for detection). 

BTW, off the top of my head, I would tend to distill the detector semantics out 
of the MIMEtypes mentioned as below;
What do you think about creating a say TikaDetector class independent from the 
MimeTypes, and get rid of MimeTypes from the detectors (i.e. getting rid of the 
implements Detector in the MimeTypes)?

I will continue to think about this design problem as we move alone, and I will 
leave notes on the ticket for sure. It looks like an important or big change, 
so any kind suggestion will be welcomed and appreciated.


Thanks
Luke

-Original Message-
From: Christian Alan Mattmann [mailto:mattm...@usc.edu] 
Sent: Wednesday, January 28, 2015 6:30 PM
To: Luke; 'Mattmann, Chris A (3980)'
Cc: nsf-polar-usc-stude...@googlegroups.com
Subject: Re: [jira] [Commented] (TIKA-1535) Inheritance modification for the 
class MIMETypes

Hi Luke, thanks much. I think we should be having this discussion on the 
dev@tika.apache.org list too, but thanks also for CC’ing the Polar students 
list.

My feeling is that Tyler has a good point and that having a BayesianDetector 
makes a ton of sense. How about we try that as a start, and see where it goes?

Cheers,
Chris


Chris Mattmann, Ph.D.
Adjunct Associate Professor, Computer Science Department University of Southern 
California Los Angeles, CA 90089 USA
Email: mattm...@usc.edu
WWW: http://sunset.usc.edu/~mattmann/





-Original Message-
From: Luke hanson311...@gmail.com
Date: Wednesday, January 28, 2015 at 5:48 PM
To: Chris Mattmann chris.a.mattm...@jpl.nasa.gov
Cc: Chris Mattmann mattm...@usc.edu, NSF Polar CyberInfrastructure DR 
Students nsf-polar-usc-stude...@googlegroups.com
Subject: FW: [jira] [Commented] (TIKA-1535) Inheritance modification for the 
class MIMETypes

Hi Professor,

I was about to modify the code to be able to work with inheritance and 
code reuse, Tyler in the following just came across and posted a 
suggestion, which is a bit enlightening.

Defining class with final in this case seems to tell me that any input 
stream that gets passed to the class is attached to one fixed type of 
MimeTypes (I tend to think the MimeTypes should be tied up with one 
input stream), or it can be interpreted it as the MimeTypes of an input stream.
If we inherit this by calling my implementation of 
MimeTypesBaysianSelection, that will look weird in a sense of 
inheritance. As my Bayesian implementation is more like an operation 
attached to that input stream's MimeTypes.

It seems MimeTypes class is not only used as a MimeType detector (it 
implements Detector interface though), but it also has some other 
purposes, eg. Users can take a peak on the input stream mimetypes, 
extension, magics, etc, that is probably why it is called MimeTypes 
rather than something like Detector; I think it is not a detector, but 
some of its methods such as getMagics or something make it easier fit 
into the slot of Detectors, as it is easier to just outfit it with an 
Detector interface and 

Re: [jira] [Commented] (TIKA-1535) Inheritance modification for the class MIMETypes

2015-01-28 Thread Mattmann, Chris A (3980)
Hi Luke,


-Original Message-
From: Luke hanson311...@gmail.com
Date: Wednesday, January 28, 2015 at 7:15 PM
To: Chris Mattmann mattm...@usc.edu, Chris Mattmann
chris.a.mattm...@jpl.nasa.gov, dev@tika.apache.org
dev@tika.apache.org
Cc: NSF Polar CyberInfrastructure DR Students
nsf-polar-usc-stude...@googlegroups.com
Subject: RE: [jira] [Commented] (TIKA-1535) Inheritance modification for
the class MIMETypes

Hi Professor and all,

Bayesian or machine learning Detector is different from Bayesian
Selection mechanism reported in TIKA-1517.
It would make sense if we implemented a machine learning algorithm in
separate Detector class, I have not gone too far with this design
thought, as I am still on the stage of the research with data collection,
once I have enough data and am able to form a model, especially I am able
to prove my concept, then I will be able to come down to the machine
learning detector implementation with design consideration. (BTW, I think
I have some ideas with data collection and training, it still takes some
time to come up with something even quick and dirty that can prove the
concept with machine learning, I am still working on the data collection,
there are also some design problems within learning techniques too, I
will come to them once I will have clear idea with the data, i think I
may have to crawl the data and label them for training, there are some
certain preprocessing steps to be cared too)

+1.


However, my current implementation in TIKA-1517 is solely based on mime
type selection(I cannot find any clearer name disguisable from
detection) with probability that might have nothing to do with the
genuine machine learning detector, it is a feature for adding weights to
each Tika mime type detection algorithm.

Gotcha.

 

But I think you are right, and in the future we kinda need it to assign
weights to a pool of detection algorithms including machine learning
techniques or content based detection algorithms, and the current
implementation of MIMEtypes with final has its design purpose, and I
don’t think it is a good idea to lump detector code within the MimeTypes,
but I will come down to this design or architecture problem once I have
some clear ideas of the machine learning model (not necessary Bayesian
model for detection).
 

BTW, off the top of my head, I would tend to distill the detector
semantics out of the MIMEtypes mentioned as below;
What do you think about creating a say TikaDetector class independent
from the MimeTypes, and get rid of MimeTypes from the
 detectors (i.e. getting rid of the implements Detector in the
MimeTypes)?

Yes, can you explore doing this?


I will continue to think about this design problem as we move alone, and
I will leave notes on the ticket for sure. It looks like an important or
big change, so any kind suggestion will be welcomed and appreciate

Thank you Luke, will do. I will read more and comment on it. Thanks for
sharing this with the list!

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++





-Original Message-
From: Christian Alan Mattmann [mailto:mattm...@usc.edu]
Sent: Wednesday, January 28, 2015 6:30 PM
To: Luke; 'Mattmann, Chris A (3980)'
Cc: nsf-polar-usc-stude...@googlegroups.com
Subject: Re: [jira] [Commented] (TIKA-1535) Inheritance modification for
the class MIMETypes

Hi Luke, thanks much. I think we should be having this discussion on the
dev@tika.apache.org list too, but thanks also for CC’ing the Polar
students list.

My feeling is that Tyler has a good point and that having a
BayesianDetector makes a ton of sense. How about we try that as a start,
and see where it goes?

Cheers,
Chris


Chris Mattmann, Ph.D.
Adjunct Associate Professor, Computer Science Department University of
Southern California Los Angeles, CA 90089 USA
Email: mattm...@usc.edu
WWW: http://sunset.usc.edu/~mattmann/





-Original Message-
From: Luke hanson311...@gmail.com
Date: Wednesday, January 28, 2015 at 5:48 PM
To: Chris Mattmann chris.a.mattm...@jpl.nasa.gov
Cc: Chris Mattmann mattm...@usc.edu, NSF Polar CyberInfrastructure DR
Students nsf-polar-usc-stude...@googlegroups.com
Subject: FW: [jira] [Commented] (TIKA-1535) Inheritance modification for
the class MIMETypes

Hi Professor,

I was about to modify the code to be able to work with inheritance and
code reuse, Tyler in 

[jira] [Comment Edited] (TIKA-1518) Docker with Tika Server

2015-01-28 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14296439#comment-14296439
 ] 

Chris A. Mattmann edited comment on TIKA-1518 at 1/29/15 6:15 AM:
--

Thanks Tyler. Can you raise #2 on infrastruct...@apache.org? That would be an 
awesome idea, and then keep folks here posted. As for #1, +1 from me. RE: #3, 
there is a TIKA issue on that, I think it's 
https://issues.apache.org/jira/browse/TIKA-1302


was (Author: chrismattmann):
Thanks Tyler. Can you raise #2 on infrastruct...@apache.org? That would be an 
awesome idea, and then keep folks here posted. As for #1, +1 from me. RE: #3, 
there is a TIKA issue on that, I think it's TIKA-1312

 Docker with Tika Server
 ---

 Key: TIKA-1518
 URL: https://issues.apache.org/jira/browse/TIKA-1518
 Project: Tika
  Issue Type: New Feature
Reporter: Paul Ramirez
 Fix For: 1.8


 This version should be able to demonstrate as many of Apache Tika's 
 capabilities as possible. For instance with GDAL, Tesseract, and FFmpeg to 
 show parsers which require installation of other dependencies. In addition, 
 this should help move TIKA-1301 forward and should leverage the suggestion 
 made by [~lewismc] of a script which can pull down the latest version of 
 Apache Tika.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1518) Docker with Tika Server

2015-01-28 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14296439#comment-14296439
 ] 

Chris A. Mattmann commented on TIKA-1518:
-

Thanks Tyler. Can you raise #2 on infrastruct...@apache.org? That would be an 
awesome idea, and then keep folks here posted. As for #1, +1 from me. RE: #3, 
there is a TIKA issue on that, I think it's TIKA-1312

 Docker with Tika Server
 ---

 Key: TIKA-1518
 URL: https://issues.apache.org/jira/browse/TIKA-1518
 Project: Tika
  Issue Type: New Feature
Reporter: Paul Ramirez
 Fix For: 1.8


 This version should be able to demonstrate as many of Apache Tika's 
 capabilities as possible. For instance with GDAL, Tesseract, and FFmpeg to 
 show parsers which require installation of other dependencies. In addition, 
 this should help move TIKA-1301 forward and should leverage the suggestion 
 made by [~lewismc] of a script which can pull down the latest version of 
 Apache Tika.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1518) Docker with Tika Server

2015-01-28 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14296439#comment-14296439
 ] 

Chris A. Mattmann edited comment on TIKA-1518 at 1/29/15 6:15 AM:
--

Thanks Tyler. Can you raise #2 on infrastruct...@apache.org? That would be an 
awesome idea, and then keep folks here posted. As for #1, +1 from me. RE: #3, 
there is a TIKA issue on that, I think it's TIKA-1302


was (Author: chrismattmann):
Thanks Tyler. Can you raise #2 on infrastruct...@apache.org? That would be an 
awesome idea, and then keep folks here posted. As for #1, +1 from me. RE: #3, 
there is a TIKA issue on that, I think it's 
https://issues.apache.org/jira/browse/TIKA-1302

 Docker with Tika Server
 ---

 Key: TIKA-1518
 URL: https://issues.apache.org/jira/browse/TIKA-1518
 Project: Tika
  Issue Type: New Feature
Reporter: Paul Ramirez
 Fix For: 1.8


 This version should be able to demonstrate as many of Apache Tika's 
 capabilities as possible. For instance with GDAL, Tesseract, and FFmpeg to 
 show parsers which require installation of other dependencies. In addition, 
 this should help move TIKA-1301 forward and should leverage the suggestion 
 made by [~lewismc] of a script which can pull down the latest version of 
 Apache Tika.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1423) Build a parser to extract data from GRIB formats

2015-01-28 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14296541#comment-14296541
 ] 

Lewis John McGibbney edited comment on TIKA-1423 at 1/29/15 7:54 AM:
-

Patch for trunk which passes all tests including issues experienced with bundle 
module. Some investigative work was required here as well as publishing 
[Unidata dependencies|http://search.maven.org/#search|ga|1|ucar] to Maven 
central and updating our [wiki 
documentation|https://wiki.apache.org/tika/ThirdPartySonaType]. 
Please insert the .grib file into
{code}
tika-parsers/src/test/resources/test-documents/gdas1.forecmwf.2014062612.grib2
{code}


was (Author: lewismc):
Patch for trunk which passes all tests including issues experienced with bundle 
module. Some investigative work was required here as well as publishing 
[Unidata dependencies|http://search.maven.org/#search|ga|1|ucar] to Maven 
central and updating our [https://wiki.apache.org/tika/ThirdPartySonaType|wiki 
documentation]. 

 Build a parser to extract data from GRIB formats
 

 Key: TIKA-1423
 URL: https://issues.apache.org/jira/browse/TIKA-1423
 Project: Tika
  Issue Type: New Feature
  Components: metadata, mime, parser
Affects Versions: 1.6
Reporter: Vineet Ghatge
Assignee: Vineet Ghatge
Priority: Critical
  Labels: features, newbie
 Fix For: 1.8

 Attachments: GRIBParsertest.java, GribParser.java, 
 NLDAS_FORA0125_H.A20130112.1200.002.grb, TIKA-1423.palsulich.120614.patch, 
 TIKA-1423.patch, TIKA-1423v2.patch, fileName.html, 
 gdas1.forecmwf.2014062612.grib2


 Arctic dataset contains a MIME format called GRIB -  General 
 Regularly­distributed information in Binary form 
 http://en.wikipedia.org/wiki/GRIB . GRIB is a well known data format which is 
 a concise data format used in meteorology to store historical and 
 weather data. There are 2 different types of the format ­ GRIB 0, GRIB 2.  
 The focus will be on GRIB 2 which is the most prevalent. Each GRIB record 
 intended for either transmission or storage contains a single parameter with 
 values located at an array of grid points, or represented as a set of 
 spectral coefficients, for a single level (or layer), encoded as a continuous 
 bit stream. Logical divisions of the record are designated as sections, 
 each of which provides control information and/or data. A GRIB record 
 consists of six sections, two of which are optional: 
  
 (0) Indicator Section 
 (1) Product Definition Section (PDS) 
 (2) Grid Description Section (GDS) ­ optional 
 (3) Bit Map Section (BMS) ­ optional 
 (4) Binary Data Section (BDS) 
 (5) '' (ASCII Characters)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: multiple detect call - different results (tika 1.7)

2015-01-28 Thread Mattmann, Chris A (3980)
Dear Gabriele,

Thanks for your question. It should be sent to dev@tika.apache.org
(moving dev-ow...@tika.apache.org to BCC).

I’ll take a look tomorrow if someone else hasn’t answered yet.

Cheers,
Chris


++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++






-Original Message-
From: Gabriele Guidi gabriele.gu...@eng.it
Date: Wednesday, January 28, 2015 at 5:25 AM
To: dev-ow...@tika.apache.org dev-ow...@tika.apache.org
Subject: multiple detect call - different results (tika 1.7)



Hi,


I found a strange behavior. I have p7m file, then I extract file inside
the signed one, after that I use tika to discover mime type, the first
call it gives me application/pdf (that's correct). BUT every next call
to the detect method of Tika to the
 same inputStream gives me application/octet-stream. ...why?
I cannot understand the behavior ...and find a solution.


Just a snipped of code:
 


InputStream inputsbust = content.getContentStream();







System.out.println( 1 mime  + filepath +  : 
+ tika.detect(inputsbust));
System.out.println( 2 mime  + filepath +  : 
+ tika.detect(inputsbust));
System.out.println( 3 mime  + filepath +  : 
+ tika.detect(inputsbust));



Result:

 1 mime /home/gguidi/01_file.pdf : application/pdf
 2 mime /home/gguidi/01_file.pdf : application/octet-stream
 3 mime /home/gguidi/01_file.pdf : application/octet-stream








Thanks


-- 


Gabriele Guidi
Direzione Pubblica Amministrazione
gabriele.gu...@eng.it

Engineering Ingegneria Informatica spa
Via Marconi, 10 - 40122, Bologna
Tel. +39-051.0435135
www.eng.it http://www.eng.it


Rispetta l'ambiente. Non stampare questa e-mail se non necessario.
Respect the environment. Please don't print this e-mail unless you really
need to.
Le informazioni trasmesse sono destinate esclusivamente alla persona o
alla società in indirizzo e sono da intendersi confidenziali e riservate.
Ogni trasmissione, inoltro, diffusione o altro uso
 di queste informazioni a persone o società differenti dal destinatario è
proibita. Se ricevete questa comunicazione per errore, contattate il
mittente e cancellate le informazioni da ogni computer.
The information transmitted is intended only for the person or entity to
which it is addressed and may contain confidential and/or privileged
material. Any review, retransmission, dissemination or other use of, or
taking of any action in reliance upon, this
 information by persons or entities other than the intended recipient is
prohibited. If you received this in error, please contact the sender and
delete the material from any computer.









RE: [jira] [Commented] (TIKA-1535) Inheritance modification for the class MIMETypes

2015-01-28 Thread Luke
Thanks professor for the prompt and kind response, will keep you updated on the 
progress and findings.

-Original Message-
From: Mattmann, Chris A (3980) [mailto:chris.a.mattm...@jpl.nasa.gov] 
Sent: Wednesday, January 28, 2015 8:17 PM
To: Luke; 'Christian Alan Mattmann'; dev@tika.apache.org
Cc: nsf-polar-usc-stude...@googlegroups.com
Subject: Re: [jira] [Commented] (TIKA-1535) Inheritance modification for the 
class MIMETypes

Hi Luke,


-Original Message-
From: Luke hanson311...@gmail.com
Date: Wednesday, January 28, 2015 at 7:15 PM
To: Chris Mattmann mattm...@usc.edu, Chris Mattmann 
chris.a.mattm...@jpl.nasa.gov, dev@tika.apache.org
dev@tika.apache.org
Cc: NSF Polar CyberInfrastructure DR Students 
nsf-polar-usc-stude...@googlegroups.com
Subject: RE: [jira] [Commented] (TIKA-1535) Inheritance modification for the 
class MIMETypes

Hi Professor and all,

Bayesian or machine learning Detector is different from Bayesian 
Selection mechanism reported in TIKA-1517.
It would make sense if we implemented a machine learning algorithm in 
separate Detector class, I have not gone too far with this design 
thought, as I am still on the stage of the research with data 
collection, once I have enough data and am able to form a model, 
especially I am able to prove my concept, then I will be able to come 
down to the machine learning detector implementation with design 
consideration. (BTW, I think I have some ideas with data collection and 
training, it still takes some time to come up with something even quick 
and dirty that can prove the concept with machine learning, I am still 
working on the data collection, there are also some design problems 
within learning techniques too, I will come to them once I will have 
clear idea with the data, i think I may have to crawl the data and 
label them for training, there are some certain preprocessing steps to 
be cared too)

+1.


However, my current implementation in TIKA-1517 is solely based on mime 
type selection(I cannot find any clearer name disguisable from
detection) with probability that might have nothing to do with the 
genuine machine learning detector, it is a feature for adding weights 
to each Tika mime type detection algorithm.

Gotcha.

 

But I think you are right, and in the future we kinda need it to assign 
weights to a pool of detection algorithms including machine learning 
techniques or content based detection algorithms, and the current 
implementation of MIMEtypes with final has its design purpose, and I 
don’t think it is a good idea to lump detector code within the 
MimeTypes, but I will come down to this design or architecture problem 
once I have some clear ideas of the machine learning model (not 
necessary Bayesian model for detection).
 

BTW, off the top of my head, I would tend to distill the detector 
semantics out of the MIMEtypes mentioned as below; What do you think 
about creating a say TikaDetector class independent from the MimeTypes, 
and get rid of MimeTypes from the  detectors (i.e. getting rid of the 
implements Detector in the MimeTypes)?

Yes, can you explore doing this?


I will continue to think about this design problem as we move alone, 
and I will leave notes on the ticket for sure. It looks like an 
important or big change, so any kind suggestion will be welcomed and 
appreciate

Thank you Luke, will do. I will read more and comment on it. Thanks for sharing 
this with the list!

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion 
Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department University of Southern 
California, Los Angeles, CA 90089 USA
++





-Original Message-
From: Christian Alan Mattmann [mailto:mattm...@usc.edu]
Sent: Wednesday, January 28, 2015 6:30 PM
To: Luke; 'Mattmann, Chris A (3980)'
Cc: nsf-polar-usc-stude...@googlegroups.com
Subject: Re: [jira] [Commented] (TIKA-1535) Inheritance modification 
for the class MIMETypes

Hi Luke, thanks much. I think we should be having this discussion on 
the dev@tika.apache.org list too, but thanks also for CC’ing the Polar 
students list.

My feeling is that Tyler has a good point and that having a 
BayesianDetector makes a ton of sense. How about we try that as a 
start, and see where it goes?

Cheers,
Chris


Chris Mattmann, Ph.D.
Adjunct Associate Professor, Computer Science Department University of 
Southern California Los Angeles, CA 90089 USA
Email: mattm...@usc.edu
WWW: http://sunset.usc.edu/~mattmann/

[jira] [Updated] (TIKA-1423) Build a parser to extract data from GRIB formats

2015-01-28 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated TIKA-1423:
---
Attachment: TIKA-1423v2.patch

Patch for trunk which passes all tests including issues experienced with bundle 
module. Some investigative work was required here as well as publishing 
[Unidata dependencies|http://search.maven.org/#search|ga|1|ucar] to Maven 
central and updating our [https://wiki.apache.org/tika/ThirdPartySonaType|wiki 
documentation]. 

 Build a parser to extract data from GRIB formats
 

 Key: TIKA-1423
 URL: https://issues.apache.org/jira/browse/TIKA-1423
 Project: Tika
  Issue Type: New Feature
  Components: metadata, mime, parser
Affects Versions: 1.6
Reporter: Vineet Ghatge
Assignee: Vineet Ghatge
Priority: Critical
  Labels: features, newbie
 Fix For: 1.8

 Attachments: GRIBParsertest.java, GribParser.java, 
 NLDAS_FORA0125_H.A20130112.1200.002.grb, TIKA-1423.palsulich.120614.patch, 
 TIKA-1423.patch, TIKA-1423v2.patch, fileName.html, 
 gdas1.forecmwf.2014062612.grib2


 Arctic dataset contains a MIME format called GRIB -  General 
 Regularly­distributed information in Binary form 
 http://en.wikipedia.org/wiki/GRIB . GRIB is a well known data format which is 
 a concise data format used in meteorology to store historical and 
 weather data. There are 2 different types of the format ­ GRIB 0, GRIB 2.  
 The focus will be on GRIB 2 which is the most prevalent. Each GRIB record 
 intended for either transmission or storage contains a single parameter with 
 values located at an array of grid points, or represented as a set of 
 spectral coefficients, for a single level (or layer), encoded as a continuous 
 bit stream. Logical divisions of the record are designated as sections, 
 each of which provides control information and/or data. A GRIB record 
 consists of six sections, two of which are optional: 
  
 (0) Indicator Section 
 (1) Product Definition Section (PDS) 
 (2) Grid Description Section (GDS) ­ optional 
 (3) Bit Map Section (BMS) ­ optional 
 (4) Binary Data Section (BDS) 
 (5) '' (ASCII Characters)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-1534) Upgrade to Commons Compress 1.9

2015-01-28 Thread Tim Allison (JIRA)
Tim Allison created TIKA-1534:
-

 Summary: Upgrade to Commons Compress 1.9
 Key: TIKA-1534
 URL: https://issues.apache.org/jira/browse/TIKA-1534
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1534) Upgrade to Commons Compress 1.9

2015-01-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14295749#comment-14295749
 ] 

Hudson commented on TIKA-1534:
--

SUCCESS: Integrated in tika-trunk-jdk1.7 #457 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/457/])
TIKA-1534: Upgrade to Commons Compress 1.9 (tallison: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1655433)
* /tika/trunk/CHANGES.txt
* /tika/trunk/tika-parsers/pom.xml


 Upgrade to Commons Compress 1.9
 ---

 Key: TIKA-1534
 URL: https://issues.apache.org/jira/browse/TIKA-1534
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Reopened] (TIKA-1329) Add RecursiveParserWrapper aka Jukka's (and Nick's) RecursiveMetadataParser

2015-01-28 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison reopened TIKA-1329:
---

Wait, do I need to update the webpage, too?  Or is that done automatically from 
tika-examples?

 Add RecursiveParserWrapper aka Jukka's (and Nick's) RecursiveMetadataParser
 ---

 Key: TIKA-1329
 URL: https://issues.apache.org/jira/browse/TIKA-1329
 Project: Tika
  Issue Type: Sub-task
  Components: parser
Reporter: Tim Allison
Priority: Minor
 Fix For: 1.8

 Attachments: TIKA-1329v2.patch, test_recursive_embedded.docx


 Jukka and Nick have a great demo of parsing metadata recursively on the 
 [wiki|http://wiki.apache.org/tika/RecursiveMetadata].  For TIKA-1302, I'd 
 like to use something similar, and I think that others may find it useful for 
 tika-app and tika-server.
 I took the code from the wiki and made some modifications.  I'm not sure if 
 we should put this in parsers or in a new module for examples.  Given that 
 I think this would be useful for tika-app and tika-server, I'd prefer 
 parsers, but I'm open to any input...including let's not.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)