from:"Luke sh \(JIRA\)"

[jira] [Commented] (TIKA-1582) Content-based Mime Detection with Byte-frequency-histogram

2015-05-10 Thread Luke sh (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14537405#comment-14537405
 ] 

Luke sh commented on TIKA-1582:
---

Thanks professor [~chrismattmann]
Hi all,
I just created a wiki page with this feature (draft), please kindly review it 
and kindly let me know if any confusion.

https://wiki.apache.org/tika/ContentMimeDetection

Thanks


 Content-based Mime Detection with Byte-frequency-histogram 
 ---

 Key: TIKA-1582
 URL: https://issues.apache.org/jira/browse/TIKA-1582
 Project: Tika
  Issue Type: Improvement
  Components: detector, mime
Affects Versions: 1.7
Reporter: Luke sh
Assignee: Chris A. Mattmann
Priority: Trivial
  Labels: memex
 Fix For: 1.9

 Attachments: nnmodel.docx, week2-report-histogram comparison.docx, 
 week6 report.docx


 Content-based mime type detection is one of the popular approaches to detect 
 mime type, there are others based on file extension and magic numbers ; And 
 currently Tika has implemented 3 approaches in detecting mime types; 
 They are :
 1) file extensions
 2) magic numbers (the most trustworthy in tika)
 3) content-type(the header in the http response if present and available) 
 Content-based mime type detection however analyses the distribution of the 
 entire stream of bytes and find a similar pattern for the same type and build 
 a function that is able to group them into one or several classes so as to 
 classify and predict; It is believed this feature might broaden the usage of 
 Tika with a bit more security enforcement for mime type detection. Because we 
 want to build a model that is etched with the patterns it has seen, in some 
 situations we may not trust those types which have not been trained/learned 
 by the model. In some situations, magic numbers imbedded in the files can be 
 copied but the actual content could be a potentially detrimental Troy 
 program. By enforcing the trust on byte frequency patterns, we are able to 
 enhance the security of the detection.
 The proposed content-based mime detection to be integrated into Tika is based 
 on the machine learning algorithm i.e. neural network with back-propagation. 
 The input: 0-255 bins each of which represent a byte, and and each of which 
 stores the count of occurrences for each byte, and the byte frequency 
 histograms are normalized to fall in the range between 0 and 1, they then are 
 passed to a companding function to enhancement the infrequent bytes.
 The output of the neural network is a binary decision 1 or 0;
 Notice BTW, the proposed feature will be implemented with GRB file type as 
 one example.
 In this example, we build a model that is able to classify GRB file type from 
 non-GRB file types, notice the size of non-GRB files is huge and cannot be 
 easily defined, so there need to be as many negative training example as 
 possible to form this non-GRB types decision boundary.
 The Neural networks is considered as two stage of processes.
 Training and classification.
 The training can be done in any programming language, in this feature 
 /research, the training of neural network is implemented in R and the source 
 can be found in my github repository i.e. 
 https://github.com/LukeLiush/filetypeDetection; i am also going to post a 
 document that describe the use of the program, the syntax/ format of the 
 input and output.
 After training, we need to export the model and import it to Tika; in Tika, 
 we create a TrainedModelDetector that reads this model file with one or more 
 model parameters or several model files,so it can detect the mime types with 
 the model of those mime types. Details of the research and usage with this 
 proposed feature will be posted on my github shortly.
 It is worth noting again that in this research we only worked out one model - 
 GRB as one example to demonstrate the use of this content-based mime 
 detection. One of the challenges again is that the non-GRB file types cannot 
 be clearly defined unless we feed our model with some example data for all of 
 the existing file types in the world, but this seems to be too utopian and a 
 bit less likely, so it is better that the set of class/types is given and 
 defined in advance to minimize the problem domain. 
 Another challenge is the size of the training data; even if we know the types 
 we want to classify, getting enough training data to form a model can be also 
 one of the main factors of success. In our example model, grb data are 
 collected from ftp://hydro1.sci.gsfc.nasa.gov/data/; and we find out that the 
 grb data from that source all exhibit a similar pattern, a simple neural 
 network structure is able to predict well, even a linear logistic regression 
 is able to do a good

[jira] [Updated] (TIKA-1582) Content-based Mime Detection with Byte-frequency-histogram

2015-05-10 Thread Luke sh (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luke sh updated TIKA-1582:
--
Summary: Content-based Mime Detection with Byte-frequency-histogram   (was: 
Mime Detection based on neural networks with Byte-frequency-histogram )

 Content-based Mime Detection with Byte-frequency-histogram 
 ---

 Key: TIKA-1582
 URL: https://issues.apache.org/jira/browse/TIKA-1582
 Project: Tika
  Issue Type: Improvement
  Components: detector, mime
Affects Versions: 1.7
Reporter: Luke sh
Assignee: Chris A. Mattmann
Priority: Trivial
  Labels: memex
 Fix For: 1.9

 Attachments: nnmodel.docx, week2-report-histogram comparison.docx, 
 week6 report.docx


 Content-based mime type detection is one of the popular approaches to detect 
 mime type, there are others based on file extension and magic numbers ; And 
 currently Tika has implemented 3 approaches in detecting mime types; 
 They are :
 1) file extensions
 2) magic numbers (the most trustworthy in tika)
 3) content-type(the header in the http response if present and available) 
 Content-based mime type detection however analyses the distribution of the 
 entire stream of bytes and find a similar pattern for the same type and build 
 a function that is able to group them into one or several classes so as to 
 classify and predict; It is believed this feature might broaden the usage of 
 Tika with a bit more security enforcement for mime type detection. Because we 
 want to build a model that is etched with the patterns it has seen, in some 
 situations we may not trust those types which have not been trained/learned 
 by the model. In some situations, magic numbers imbedded in the files can be 
 copied but the actual content could be a potentially detrimental Troy 
 program. By enforcing the trust on byte frequency patterns, we are able to 
 enhance the security of the detection.
 The proposed content-based mime detection to be integrated into Tika is based 
 on the machine learning algorithm i.e. neural network with back-propagation. 
 The input: 0-255 bins each of which represent a byte, and and each of which 
 stores the count of occurrences for each byte, and the byte frequency 
 histograms are normalized to fall in the range between 0 and 1, they then are 
 passed to a companding function to enhancement the infrequent bytes.
 The output of the neural network is a binary decision 1 or 0;
 Notice BTW, the proposed feature will be implemented with GRB file type as 
 one example.
 In this example, we build a model that is able to classify GRB file type from 
 non-GRB file types, notice the size of non-GRB files is huge and cannot be 
 easily defined, so there need to be as many negative training example as 
 possible to form this non-GRB types decision boundary.
 The Neural networks is considered as two stage of processes.
 Training and classification.
 The training can be done in any programming language, in this feature 
 /research, the training of neural network is implemented in R and the source 
 can be found in my github repository i.e. 
 https://github.com/LukeLiush/filetypeDetection; i am also going to post a 
 document that describe the use of the program, the syntax/ format of the 
 input and output.
 After training, we need to export the model and import it to Tika; in Tika, 
 we create a TrainedModelDetector that reads this model file with one or more 
 model parameters or several model files,so it can detect the mime types with 
 the model of those mime types. Details of the research and usage with this 
 proposed feature will be posted on my github shortly.
 It is worth noting again that in this research we only worked out one model - 
 GRB as one example to demonstrate the use of this content-based mime 
 detection. One of the challenges again is that the non-GRB file types cannot 
 be clearly defined unless we feed our model with some example data for all of 
 the existing file types in the world, but this seems to be too utopian and a 
 bit less likely, so it is better that the set of class/types is given and 
 defined in advance to minimize the problem domain. 
 Another challenge is the size of the training data; even if we know the types 
 we want to classify, getting enough training data to form a model can be also 
 one of the main factors of success. In our example model, grb data are 
 collected from ftp://hydro1.sci.gsfc.nasa.gov/data/; and we find out that the 
 grb data from that source all exhibit a similar pattern, a simple neural 
 network structure is able to predict well, even a linear logistic regression 
 is able to do a good job; However, if we pass the GRB files collected from 
 other source to the model for prediction, then we find out that the

[jira] [Commented] (TIKA-1582) Mime Detection based on neural networks with Byte-frequency-histogram

2015-05-04 Thread Luke sh (JIRA)

[
https://issues.apache.org/jira/browse/TIKA-1582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526717#comment-14526717
]

Luke sh commented on TIKA-1582:
---

Thanks a lot [~talli...@apache.org] for the comments, here are my thoughts.

[Tim]: I'm not sure how to use it, or even how to know when I'd want to.
[Luke]:The idea behind this feature is to give an option in Tika that allows
users to apply content-based mime detection, the algorithm itself does not seem
to matter much, neural network, svm, baysian, etc. and there is no single best
machine learning algorithm that fit into every data problem, e.g. users can
also use a simple linear classification technique to classify their file types
as long as it meets their goals, this also requires a bit empirical analysis
with those learning algorithms.
Nevertheless, in my opinion what it matters may be the data they use to
classify. The patterns or the knowledge that comes from the data may be
specialized in one domain, the understanding requires a bit domain knowledge
which may be the key to develop a high-accuracy learning system. Alternatively,
i might ask myself can a human expert classify the file types by looking at the
input X (e.g. histogram, actual bytes); if we think about every existing types
in the world, then i probably dont think a human is able to learn that
accurately; but if we considered some types (1,2 or several) well defined, i
would probably say the detection accuracy could be much higher. when users want
to have more security or insurance with some particular file types detection,
they probably can define or develop their own learning algorithm, they can use
svm, baysian, neural net, etc (whatever they want) to further undergird the
security detection, as long as they have trained a good model. From this
perspectives, the users might also need a bit of knowledge of the machine
learning algorithm they want to use.

I have not taken a closer look at the links e.g.
http://www.dfrws.org/2012/proceedings/DFRWS2012-5.pdf, but i guess the tests
are based on some file types, the accuracy may not be 100% for each type in the
tests; In essence the machine learning algorithms might be good at estimation
with exciting knowledge. Again if we apply our existing knowledge in the
detection, we probably can enhance the detection security.

If you have any confusion, please kindly let me know, any kind comments are
welcome and appreciated.

Thanks

Mime Detection based on neural networks with Byte-frequency-histogram
--

Key: TIKA-1582
URL: https://issues.apache.org/jira/browse/TIKA-1582
Project: Tika
Issue Type: Improvement
Components: detector, mime
Affects Versions: 1.7
Reporter: Luke sh
Assignee: Chris A. Mattmann
Priority: Trivial
Labels: memex
Fix For: 1.9

Attachments: nnmodel.docx, week2-report-histogram comparison.docx,
week6 report.docx

Content-based mime type detection is one of the popular approaches to detect
mime type, there are others based on file extension and magic numbers ; And
currently Tika has implemented 3 approaches in detecting mime types;
They are :
1) file extensions
2) magic numbers (the most trustworthy in tika)
3) content-type(the header in the http response if present and available)
Content-based mime type detection however analyses the distribution of the
entire stream of bytes and find a similar pattern for the same type and build
a function that is able to group them into one or several classes so as to
classify and predict; It is believed this feature might broaden the usage of
Tika with a bit more security enforcement for mime type detection. Because we
want to build a model that is etched with the patterns it has seen, in some
situations we may not trust those types which have not been trained/learned
by the model. In some situations, magic numbers imbedded in the files can be
copied but the actual content could be a potentially detrimental Troy
program. By enforcing the trust on byte frequency patterns, we are able to
enhance the security of the detection.
The proposed content-based mime detection to be integrated into Tika is based
on the machine learning algorithm i.e. neural network with back-propagation.
The input: 0-255 bins each of which represent a byte, and and each of which
stores the count of occurrences for each byte, and the byte frequency
histograms are normalized to fall in the range between 0 and 1, they then are
passed to a companding function to enhancement the infrequent bytes.
The output of the neural network is a binary decision 1 or 0;
Notice BTW, the proposed feature will be implemented with GRB file type as
one example.
In this

[jira] [Commented] (TIKA-1582) Mime Detection based on neural networks with Byte-frequency-histogram

2015-05-02 Thread Luke sh (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14525379#comment-14525379
 ] 

Luke sh commented on TIKA-1582:
---

Sure [~chrismattmann], i will work on the wiki for this content detection 
feature TIKA-1582.

Thanks  [~gagravarr] for the comments, but 
https://wiki.apache.org/tika/BaysianMimeTypeSelector is a separate feature that 
corresponds to TIKA-1517.

Thanks



 Mime Detection based on neural networks with Byte-frequency-histogram 
 --

 Key: TIKA-1582
 URL: https://issues.apache.org/jira/browse/TIKA-1582
 Project: Tika
  Issue Type: Improvement
  Components: detector, mime
Affects Versions: 1.7
Reporter: Luke sh
Assignee: Chris A. Mattmann
Priority: Trivial
  Labels: memex
 Fix For: 1.9

 Attachments: nnmodel.docx, week2-report-histogram comparison.docx, 
 week6 report.docx


 Content-based mime type detection is one of the popular approaches to detect 
 mime type, there are others based on file extension and magic numbers ; And 
 currently Tika has implemented 3 approaches in detecting mime types; 
 They are :
 1) file extensions
 2) magic numbers (the most trustworthy in tika)
 3) content-type(the header in the http response if present and available) 
 Content-based mime type detection however analyses the distribution of the 
 entire stream of bytes and find a similar pattern for the same type and build 
 a function that is able to group them into one or several classes so as to 
 classify and predict; It is believed this feature might broaden the usage of 
 Tika with a bit more security enforcement for mime type detection. Because we 
 want to build a model that is etched with the patterns it has seen, in some 
 situations we may not trust those types which have not been trained/learned 
 by the model. In some situations, magic numbers imbedded in the files can be 
 copied but the actual content could be a potentially detrimental Troy 
 program. By enforcing the trust on byte frequency patterns, we are able to 
 enhance the security of the detection.
 The proposed content-based mime detection to be integrated into Tika is based 
 on the machine learning algorithm i.e. neural network with back-propagation. 
 The input: 0-255 bins each of which represent a byte, and and each of which 
 stores the count of occurrences for each byte, and the byte frequency 
 histograms are normalized to fall in the range between 0 and 1, they then are 
 passed to a companding function to enhancement the infrequent bytes.
 The output of the neural network is a binary decision 1 or 0;
 Notice BTW, the proposed feature will be implemented with GRB file type as 
 one example.
 In this example, we build a model that is able to classify GRB file type from 
 non-GRB file types, notice the size of non-GRB files is huge and cannot be 
 easily defined, so there need to be as many negative training example as 
 possible to form this non-GRB types decision boundary.
 The Neural networks is considered as two stage of processes.
 Training and classification.
 The training can be done in any programming language, in this feature 
 /research, the training of neural network is implemented in R and the source 
 can be found in my github repository i.e. 
 https://github.com/LukeLiush/filetypeDetection; i am also going to post a 
 document that describe the use of the program, the syntax/ format of the 
 input and output.
 After training, we need to export the model and import it to Tika; in Tika, 
 we create a TrainedModelDetector that reads this model file with one or more 
 model parameters or several model files,so it can detect the mime types with 
 the model of those mime types. Details of the research and usage with this 
 proposed feature will be posted on my github shortly.
 It is worth noting again that in this research we only worked out one model - 
 GRB as one example to demonstrate the use of this content-based mime 
 detection. One of the challenges again is that the non-GRB file types cannot 
 be clearly defined unless we feed our model with some example data for all of 
 the existing file types in the world, but this seems to be too utopian and a 
 bit less likely, so it is better that the set of class/types is given and 
 defined in advance to minimize the problem domain. 
 Another challenge is the size of the training data; even if we know the types 
 we want to classify, getting enough training data to form a model can be also 
 one of the main factors of success. In our example model, grb data are 
 collected from ftp://hydro1.sci.gsfc.nasa.gov/data/; and we find out that the 
 grb data from that source all exhibit a similar pattern, a simple neural 
 network structure is able to predict well,

[jira] [Commented] (TIKA-1517) MIME type selection with probability

2015-04-25 Thread Luke sh (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14512284#comment-14512284
 ] 

Luke sh commented on TIKA-1517:
---

Notes: 
 A Pull request with adding the support to Tika() facet.
If a user wants to use this feature, the following code would be needed.

tika = new Tika(new TikaConfig() {
   @Override
protected Detector getDefaultDetector(MimeTypes types,ServiceLoader loader) {
   /*
   * here is an example with the use of the builder to
   * instantiate the object.
*/
   Builder builder = new ProbabilisticMimeDetectionSelector.Builder();
ProbabilisticMimeDetectionSelector proDetector = new 
ProbabilisticMimeDetectionSelector(
types, builder.priorMagicFileType(0.5f)
.priorExtensionFileType(0.5f)
.priorMetaFileType(0.5f));
return new DefaultProbDetector(proDetector, loader);
   }
   });
The idea is simple that we overwrite the getDefaultDetector() by providing the 
DefaultProbDetector which extends the CompositeDetector, a CompositeDetector is 
one (whose supertype is Detector) that takes a list of detectors, and when its 
detect() method gets called, each detector in the list is called sequentially 
one after another. The original implemenation of getDefaultDetector() in 
TikaConfig returns an instance of “DefaultDetector” that also extends the 
CompositeDetector by providing a list of detectors that includes “MimeTypes” 
which is the native implemenation with 3 detectors (i.e. magic bytes, extension 
and metadatahint). However, DefaultProbDetector replaces this MimeTypes with 
ProbabilisticMimeDetectionSelector.

In order to set the preferential weights, an instance of 
ProbabilisticMimeDetectionSelector can be created in the above example snippet.

Alternatively, if we dont want to go with the default settings with 
ProbabilisticMimeDetectionSelector, it is ok to just ignore the arguments by 
calling only “return new DefaultProbDetector();”.
Alternatively, if we dont want to write some extra code, the following can also 
be used.
/*
 * an xml file needs to be given to tell where the detector is located, 
customers can build  
 * their own detectors by including or excluding this feature or any detectors 
in the
 * composite list at their will.
 */
Tika tika = new Tika(new TikaConfig(new File(TIKA-detector-sample.xml)));


TIKA-detector-sample.xml
?xml version=1.0 encoding=UTF-8?
!--
  Licensed to the Apache Software Foundation (ASF) under one or more
  contributor license agreements.  See the NOTICE file distributed with
  this work for additional information regarding copyright ownership.
  The ASF licenses this file to You under the Apache License, Version 2.0
  (the License); you may not use this file except in compliance with
  the License.  You may obtain a copy of the License at

  http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an AS IS BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License.
--
properties
  detectors
detector class=org.apache.tika.detect.DefaultProbDetector/
  /detectors
/properties




 MIME type selection with probability
 

 Key: TIKA-1517
 URL: https://issues.apache.org/jira/browse/TIKA-1517
 Project: Tika
  Issue Type: Improvement
  Components: mime
Affects Versions: 0.1-incubating, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 
 0.10, 1.0, 1.1, 1.2, 1.3, 1.4, 1.5, 1.6
Reporter: Luke sh
Priority: Trivial
 Attachments: BaysianTest.java


 Improvement and intuition
 The original implementation for MIME type selection/detection is a bit less 
 flexible by initial design, as it heavily relies on the outcome produced by 
 magic-bytes MIME Type identification; Thus e.g. if magic-bytes is applicable 
 in a file, Tika will follow the file type detected by magic-bytes. It may be 
 better to provide more control over the method of choice.
 This proposed approach slightly incorporate the Bayesian probability theorem, 
 where users are able to assign weights to each approach in terms of 
 probability, so they have the control over preference of which file type or 
 mime type identification methods implemented/available in Tika, and currently 
 there are 3 methods for identifying MIME type in Tika (i.e. Magic-Bytes, File 
 extension and Metadata content-type hint). By introducing some weights on the 
 approach in the proposed approach, users are able to choose which method they 
 trust most, the magic-bytes method is often

[jira] [Commented] (TIKA-1517) MIME type selection with probability

2015-04-25 Thread Luke sh (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14512286#comment-14512286
 ] 

Luke sh commented on TIKA-1517:
---

Notes:

this feature is also tested with the data from gen-common-crawl.sh 
(https://github.com/LukeLiush/trec-dd-polar )
this feature with default settings behaves as expected with that data in the 
test.

The following is copied from the email update for the test.

Both (tika with the prob feature and the one without it) produced the same 
stats total, please see the attached matched.txt dumped by the small program 
that verbatim checks and compares each line in every section of the Stats 
total between the log produced by the tika that has the feature and the one 
without it;  so if the string.equals(...) satisfies, the string of the line 
will be dumped out. If there is a mismatch(e.g. the count for a particular mime 
type is different), an error will be dumped out. Eventually, I don’t see any 
error in the printout, I think the feature seem to have passed the test.
 
The processing time between 2 tests is as follows.
The following shows the start time and end time for the test with the Nutch 
dumper tool integrated with the prob selection feature.
from
2015-04-22 15:47:08,330
to
2015-04-22 17:48:28,877
 
The following shows the start time and end time for the test without the prob 
selection feature.
from
2015-04-22 22:41:23,459
to
2015-04-23 00:11:02,767

 MIME type selection with probability
 

 Key: TIKA-1517
 URL: https://issues.apache.org/jira/browse/TIKA-1517
 Project: Tika
  Issue Type: Improvement
  Components: mime
Affects Versions: 0.1-incubating, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 
 0.10, 1.0, 1.1, 1.2, 1.3, 1.4, 1.5, 1.6
Reporter: Luke sh
Priority: Trivial
 Attachments: BaysianTest.java


 Improvement and intuition
 The original implementation for MIME type selection/detection is a bit less 
 flexible by initial design, as it heavily relies on the outcome produced by 
 magic-bytes MIME Type identification; Thus e.g. if magic-bytes is applicable 
 in a file, Tika will follow the file type detected by magic-bytes. It may be 
 better to provide more control over the method of choice.
 This proposed approach slightly incorporate the Bayesian probability theorem, 
 where users are able to assign weights to each approach in terms of 
 probability, so they have the control over preference of which file type or 
 mime type identification methods implemented/available in Tika, and currently 
 there are 3 methods for identifying MIME type in Tika (i.e. Magic-Bytes, File 
 extension and Metadata content-type hint). By introducing some weights on the 
 approach in the proposed approach, users are able to choose which method they 
 trust most, the magic-bytes method is often trust-worthy though. But the 
 virtue is that in some situations, file type identification must be 
 sensitive, some might want all of the MIME type identification methods to 
 agree on the same file type before they start processing those files, 
 incorrect file type identification is less intolerable. The current 
 implementation seems to be less flexible for this purpose and heavily rely on 
 the Magic-bytes file identification method (although magic-bytes is most 
 reliable compared to the other 2 ); 
 Proposed design:
 The idea of selection is to incorporate probability as weights on each MIME 
 type identification method currently being implemented in Tika (they are 
 Magic bytes approach, file extension match and metadata content-type hint).
 for example,
 as an user, i would probably like to assign the the preference to the method 
 based on the degree of the trust, and order the results if they don't 
 coincide.
 Bayesian rule may be a bit appropriate here to meet the intuition.
 The following is what are needed for Bayesian rule implementation.
  Prior probability P(file_type) e.g. P(pdf), theoretically this is computed 
  based on the samples, and this depends on the domain or use cases, 
  intuitively we more care the orders of the weights or probability of the 
  results rather than the actual numbers, and also the context of Prior 
  depends on samples for a particular use case or domain, e.g. if we happen 
  to crawl a website that contains mostly the pdf files, we probably can 
  collect some samples and compute the prior, based on the samples we can say 
  90% of docs are pdf, so our prior is defined to be P(pdf) = 0.9, but here 
  we propose to define the prior as configurable param for users, and by 
  default we leave the prior to be unapplicable. Alternatively, we can 
  define prior for each file type to be  1/[number of supported file types in 
  Tika] I think the number would be approximately 1/1157 and using this 
  number seems to be

[jira] [Commented] (TIKA-1610) CBOR Parser and detection [improvement]

2015-04-23 Thread Luke sh (JIRA)

[
https://issues.apache.org/jira/browse/TIKA-1610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14510402#comment-14510402
]

Luke sh commented on TIKA-1610:
---

Thanks a lot [~gagravarr] for the prompt response.
I thought it would be probably be risky if we discard any one of the estimated
types because of the magic priority (one is higher than the other, i wanted
tika to rely on the extension when there is a tie to break.

For now, in this particular case, i also cannot think of any reason why we
don't use 60, might be i am too skeptical.

Thanks

CBOR Parser and detection [improvement]
---

Key: TIKA-1610
URL: https://issues.apache.org/jira/browse/TIKA-1610
Project: Tika
Issue Type: New Feature
Components: detector, mime, parser
Affects Versions: 1.7
Reporter: Luke sh
Assignee: Chris A. Mattmann
Priority: Trivial
Labels: memex
Attachments: 142440269.html, NUTCH-1997.cbor,
cbor_tika.mimetypes.xml.jpg, rfc_cbor.jpg

CBOR is a data format whose design goals include the possibility of extremely
small code size, fairly small message size, and extensibility without the
need for version negotiation (cited from http://cbor.io/ ).
It would be great if Tika is able to provide the support with CBOR parser and
identification. In the current project with Nutch, the Nutch
CommonCrawlDataDumper is used to dump the crawled segments to the files in
the format of CBOR. In order to read/parse those dumped files by this tool,
it would be great if tika is able to support parsing the cbor, the thing is
that the CommonCrawlDataDumper is not dumping with correct extension, it
dumps with its own rule, the default extension of the dumped file is html, so
it might be less painful if tika is able to detect and parse those files
without any pre-processing steps.
CommonCrawlDataDumper is calling the following to dump with cbor.
import com.fasterxml.jackson.dataformat.cbor.CBORFactory;
import com.fasterxml.jackson.dataformat.cbor.CBORGenerator;
fasterxml is a 3rd party library for converting json to .cbor and Vice Versa.
According to RFC 7049 (http://tools.ietf.org/html/rfc7049), it looks like
CBOR does not yet have its magic numbers to be detected/identified by other
applications (PFA: rfc_cbor.jpg)
It seems that the only way to inform other applications of the type as of now
is using the extension (i.e. .cbor), or probably content detection (i.e. byte
histogram distribution estimation).
There is another thing worth the attention, it looks like tika has attempted
to add the support with cbor mime detection in the tika-mimetypes.xml
(PFA:cbor_tika.mimetypes.xml.jpg); This detection is not working with the
cbor file dumped by CommonCrawlDataDumper.
According to http://tools.ietf.org/html/rfc7049#section-2.4.5, there is a
self-describing Tag 55799 that seems to be used for cbor type
identification(the hex code might be 0xd9d9f7), but it is probably up to the
application that take care of this tag, and it is also possible that the
fasterxml that the nutch dumping tool is missing this tag, an example cbor
file dumped by the Nutch tool i.e. CommonCrawlDataDumper has also been
attached (PFA: 142440269.html).
The following info is cited from the rfc, ...a decoder might be able to
parse both CBOR and JSON.
Such a decoder would need to mechanically distinguish the two
formats. An easy way for an encoder to help the decoder would be to
tag the entire CBOR item with tag 55799, the serialization of which
will never be found at the beginning of a JSON text...
It looks like the a file can have two parts/sections i.e. the plain text
parts and the json prettified by cbor, this might be also worth the attention
and consideration with the parsing and type identification.
On the other hand, it is worth noting that the entries for cbor extension
detection needs to be appended in the tika-mimetypes.xml too
e.g.
glob pattern=*.cbor/

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-1610) CBOR Parser and detection [improvement]

2015-04-23 Thread Luke sh (JIRA)

[
https://issues.apache.org/jira/browse/TIKA-1610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Luke sh updated TIKA-1610:
--
Attachment: NUTCH-1997.cbor

CBOR Parser and detection [improvement]
---

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (TIKA-1610) CBOR Parser and detection [improvement]

2015-04-23 Thread Luke sh (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14510382#comment-14510382
 ] 

Luke sh edited comment on TIKA-1610 at 4/24/15 2:43 AM:


Notes:
The attached cbor file(i.e.NUTCH-1997.cbor) contains both magic bytes for type 
xhtml and type cbor, with priority 40 on application/cbor, we will have the 
following issues

Problem1: Magic priority 40.
The application/xhtml+xml has higher priority(50) than application/cbor (40); 
[I don't know who (and why) assigned 40 to cbor]; So if xhtml gets read and 
compared first, cbor will not even be placed in the magic estimation list 
because it has low priority. Based on the tests, it turns out that it is true 
that xhtml gets read and compared first with the input file, so any type below 
the priority 50 will be disregarded.

Problem2: again magic priority with 50.
In Tika, given a file dumped by the nutch dumper tool, both types (xhtml and 
cbor) will be selected as candidate mime types and they will be put in the 
magic estimation list; since xhtml type gets read first, it is placed atop the 
cbor; in order to break that tie, tika will rely on the decision from the 
extension method. If the extension method fails to detect the type(for now, 
let's ignore metadata hint method for simplicity but the same applies to it 
too), then xhtml gets returned eventually.

My pull request to be sent : I am going to set the magic priority of cbor type 
to 50 the same as xhtml, because it would probably be risky to discard any one 
of the estimated types without going consult the extension method.



was (Author: lukeliush):
Notes:
The attached cbor file contains both magic bytes for type xhtml and type cbor, 
with priority 40 on application/cbor, we will have the following issues

Problem1: Magic priority 40.
The application/xhtml+xml has higher priority(50) than application/cbor (40); 
[I don't know who (and why) assigned 40 to cbor]; So if xhtml gets read and 
compared first, cbor will not even be placed in the magic estimation list 
because it has low priority. Based on the tests, it turns out that it is true 
that xhtml gets read and compared first with the input file, so any type below 
the priority 50 will be disregarded.

Problem2: again magic priority with 50.
In Tika, given a file dumped by the nutch dumper tool, both types (xhtml and 
cbor) will be selected as candidate mime types and they will be put in the 
magic estimation list; since xhtml type gets read first, it is placed atop the 
cbor; in order to break that tie, tika will rely on the decision from the 
extension method. If the extension method fails to detect the type(for now, 
let's ignore metadata hint method for simplicity but the same applies to it 
too), then xhtml gets returned eventually.

My pull request to be sent : I am going to set the magic priority of cbor type 
to 50 the same as xhtml, because it would probably be risky to discard any one 
of the estimated types without going consult the extension method.


 CBOR Parser and detection [improvement]
 ---

 Key: TIKA-1610
 URL: https://issues.apache.org/jira/browse/TIKA-1610
 Project: Tika
  Issue Type: New Feature
  Components: detector, mime, parser
Affects Versions: 1.7
Reporter: Luke sh
Assignee: Chris A. Mattmann
Priority: Trivial
  Labels: memex
 Attachments: 142440269.html, NUTCH-1997.cbor, 
 cbor_tika.mimetypes.xml.jpg, rfc_cbor.jpg


 CBOR is a data format whose design goals include the possibility of extremely 
 small code size, fairly small message size, and extensibility without the 
 need for version negotiation (cited from http://cbor.io/ ).
 It would be great if Tika is able to provide the support with CBOR parser and 
 identification. In the current project with Nutch, the Nutch 
 CommonCrawlDataDumper is used to dump the crawled segments to the files in 
 the format of CBOR. In order to read/parse those dumped files by this tool, 
 it would be great if tika is able to support parsing the cbor, the thing is 
 that the CommonCrawlDataDumper is not dumping with correct extension, it 
 dumps with its own rule, the default extension of the dumped file is html, so 
 it might be less painful if tika is able to detect and parse those files 
 without any pre-processing steps. 
 CommonCrawlDataDumper is calling the following to dump with cbor.
 import com.fasterxml.jackson.dataformat.cbor.CBORFactory;
 import com.fasterxml.jackson.dataformat.cbor.CBORGenerator;
 fasterxml is a 3rd party library for converting json to .cbor and Vice Versa.
 According to RFC 7049 (http://tools.ietf.org/html/rfc7049), it looks like 
 CBOR does not yet have its magic numbers to be detected/identified by other 
 applications (PFA: rfc_cbor.jpg)
 It seems that

[jira] [Commented] (TIKA-1610) CBOR Parser and detection [improvement]

2015-04-23 Thread Luke sh (JIRA)

[
https://issues.apache.org/jira/browse/TIKA-1610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14510382#comment-14510382
]

Luke sh commented on TIKA-1610:
---

Notes:
The attached cbor file contains both magic bytes for type xhtml and type cbor,
with priority 40 on application/cbor, we will have the following issues

Problem1: Magic priority 40.
The application/xhtml+xml has higher priority(50) than application/cbor (40);
[I don't know who (and why) assigned 40 to cbor]; So if xhtml gets read and
compared first, cbor will not even be placed in the magic estimation list
because it has low priority. Based on the tests, it turns out that it is true
that xhtml gets read and compared first with the input file, so any type below
the priority 50 will be disregarded.

Problem2: again magic priority with 50.
In Tika, given a file dumped by the nutch dumper tool, both types (xhtml and
cbor) will be selected as candidate mime types and they will be put in the
magic estimation list; since xhtml type gets read first, it is placed atop the
cbor; in order to break that tie, tika will rely on the decision from the
extension method. If the extension method fails to detect the type(for now,
let's ignore metadata hint method for simplicity but the same applies to it
too), then xhtml gets returned eventually.

My pull request to be sent : I am going to set the magic priority of cbor type
to 50 the same as xhtml, because it would probably be risky to discard any one
of the estimated types without going consult the extension method.

CBOR Parser and detection [improvement]
---

[jira] [Updated] (TIKA-1610) CBOR Parser and detection improvement

2015-04-21 Thread Luke sh (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luke sh updated TIKA-1610:
--
Attachment: cbor_tika.mimetypes.xml.jpg
rfc_cbor.jpg

 CBOR Parser and detection improvement
 -

 Key: TIKA-1610
 URL: https://issues.apache.org/jira/browse/TIKA-1610
 Project: Tika
  Issue Type: New Feature
  Components: detector, mime, parser
Affects Versions: 1.7
Reporter: Luke sh
Priority: Trivial
  Labels: memex
 Attachments: cbor_tika.mimetypes.xml.jpg, rfc_cbor.jpg


 CBOR is a data format whose design goals include the possibility of extremely 
 small code size, fairly small message size, and extensibility without the 
 need for version negotiation (cited from http://cbor.io/ ).
 It would be great if Tika is able to provide the support with CBOR parser and 
 identification. In the current project with Nutch, the Nutch 
 CommonCrawlDataDumper is used to dump the crawled segments to the files in 
 the format of CBOR. In order to read/parse those dumped files by this tool, 
 it would be great if tika is able to support parsing the cbor, the thing is 
 that the CommonCrawlDataDumper is not dumping with correct extension, it 
 dumps with its own rule, the default extension of the dumped file is html, so 
 it might be less painful if tika is able to detect and parse those files 
 without any pre-processing steps. 
 CommonCrawlDataDumper is calling the following to dump with cbor.
 import com.fasterxml.jackson.dataformat.cbor.CBORFactory;
 import com.fasterxml.jackson.dataformat.cbor.CBORGenerator;
 fasterxml is a 3rd party library for converting json to .cbor and Vice Versa.
 According to RFC 7049 (http://tools.ietf.org/html/rfc7049), it looks like 
 CBOR does not yet have its magic numbers to be detected/identified by other 
 applications (PFA: rfc_cbor.jpg)
 It seems that the only way to inform other applications of the type as of now 
 is using the extension (i.e. .cbor), or probably content detection (i.e. byte 
 histogram distribution estimation).  
 There is another thing worth the attention, it looks like tika has attempted 
 to add the support with cbor mime detection in the tika-mimetypes.xml 
 (PFA:cbor_tika.mimetypes.xml.jpg); This detection is not working with the 
 cbor file dumped by CommonCrawlDataDumper. 
 According to http://tools.ietf.org/html/rfc7049, there is a self-describing 
 Tag 55799 that seems to be used for cbor type identification, but it is 
 probably up to the application that take care of this tag, and it is also 
 possible that the fasterxml is not missing this tag. 
 On the other hand, it is worth noting that the entries for cbor extension 
 detection needs to be appended in the tika-mimetypes.xml too 
 glob pattern=*.cbor/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-1610) CBOR Parser and detection improvement

2015-04-21 Thread Luke sh (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luke sh updated TIKA-1610:
--
Description: 
CBOR is a data format whose design goals include the possibility of extremely 
small code size, fairly small message size, and extensibility without the need 
for version negotiation (cited from http://cbor.io/ ).

It would be great if Tika is able to provide the support with CBOR parser and 
identification. In the current project with Nutch, the Nutch 
CommonCrawlDataDumper is used to dump the crawled segments to the files in the 
format of CBOR. In order to read/parse those dumped files by this tool, it 
would be great if tika is able to support parsing the cbor, the thing is that 
the CommonCrawlDataDumper is not dumping with correct extension, it dumps with 
its own rule, the default extension of the dumped file is html, so it might be 
less painful if tika is able to detect and parse those files without any 
pre-processing steps. 

CommonCrawlDataDumper is calling the following to dump with cbor.
import com.fasterxml.jackson.dataformat.cbor.CBORFactory;
import com.fasterxml.jackson.dataformat.cbor.CBORGenerator;

fasterxml is a 3rd party library for converting json to .cbor and Vice Versa.

According to RFC 7049 (http://tools.ietf.org/html/rfc7049), it looks like CBOR 
does not yet have its magic numbers to be detected/identified by other 
applications (PFA: rfc_cbor.jpg)
It seems that the only way to inform other applications of the type as of now 
is using the extension (i.e. .cbor), or probably content detection (i.e. byte 
histogram distribution estimation).  

There is another thing worth the attention, it looks like tika has attempted to 
add the support with cbor mime detection in the tika-mimetypes.xml 
(PFA:cbor_tika.mimetypes.xml.jpg); This detection is not working with the cbor 
file dumped by CommonCrawlDataDumper. 
According to http://tools.ietf.org/html/rfc7049, there is a self-describing Tag 
55799 that seems to be used for cbor type identification, but it is probably up 
to the application that take care of this tag, and it is also possible that the 
fasterxml is not missing this tag. 

On the other hand, it is worth noting that the entries for cbor extension 
detection needs to be appended in the tika-mimetypes.xml too 
glob pattern=*.cbor/



  was:
CBOR is a data format whose design goals include the possibility of extremely 
small code size, fairly small message size, and extensibility without the need 
for version negotiation (cited from http://cbor.io/ ).

It would be great if Tika is able to provide the support with CBOR parser and 
identification. In the current project with Nutch, the CommonCrawlDataDumper is 
a tool that comes with Nutch and it is used to dump the crawled segments to the 
files in the format of CBOR. In order to read/parse those dumped files by this 
tool, it would be great if tika is able to support the parsing and detecting, 
the surprise is that the CommonCrawlDataDumper is not dumping with correct 
extension, it dumps with its own rule, the default is html, so it might be less 
painful if tika is able to detect and parse those files without any 
pre-processing steps. 

CommonCrawlDataDumper is calling the following to dump with cbor.
import com.fasterxml.jackson.dataformat.cbor.CBORFactory;
import com.fasterxml.jackson.dataformat.cbor.CBORGenerator;

fasterxml is a 3rd party library for converting json to .cbor and Vice Versa.

According to RFC 7049 (http://tools.ietf.org/html/rfc7049), it looks like CBOR 
does not yet have its magic numbers to be detected/identified by other 
applications (PFA: rfc_cbor.jpg)
It seems that the only way to inform other applications of the type as of now 
is using the extension (i.e. .cbor), or probably content detection (i.e. byte 
histogram distribution estimation).  

There is another thing worth the attention, it looks like tika has attempted to 
add the support with cbor mime detection in the tika-mimetypes.xml 
(PFA:cbor_tika.mimetypes.xml.jpg); This detection is not working with the cbor 
file dumped by CommonCrawlDataDumper. 
According to http://tools.ietf.org/html/rfc7049, there is a self-describing Tag 
55799 that seems to be used for cbor type identification, but it is probably up 
to the application that take care of this tag, and it is also possible that the 
fasterxml is not missing this tag. 

On the other hand, it is worth noting that the entries for cbor extension 
detection needs to be appended in the tika-mimetypes.xml too 
glob pattern=*.cbor/




 CBOR Parser and detection improvement
 -

 Key: TIKA-1610
 URL: https://issues.apache.org/jira/browse/TIKA-1610
 Project: Tika
  Issue Type: New Feature
  Components: detector, mime, parser
Affects Versions: 1.7
Reporter: Luke sh
Priority:

[jira] [Created] (TIKA-1610) CBOR Parser and detection improvement

2015-04-21 Thread Luke sh (JIRA)

Luke sh created TIKA-1610:
-

 Summary: CBOR Parser and detection improvement
 Key: TIKA-1610
 URL: https://issues.apache.org/jira/browse/TIKA-1610
 Project: Tika
  Issue Type: New Feature
  Components: detector, mime, parser
Affects Versions: 1.7
Reporter: Luke sh
Priority: Trivial


CBOR is a data format whose design goals include the possibility of extremely 
small code size, fairly small message size, and extensibility without the need 
for version negotiation (cited from http://cbor.io/ ).

It would be great if Tika is able to provide the support with CBOR parser and 
identification. In the current project with Nutch, the CommonCrawlDataDumper is 
a tool that comes with Nutch and it is used to dump the crawled segments to the 
files in the format of CBOR. In order to read/parse those dumped files by this 
tool, it would be great if tika is able to support the parsing and detecting, 
the surprise is that the CommonCrawlDataDumper is not dumping with correct 
extension, it dumps with its own rule, the default is html, so it might be less 
painful if tika is able to detect and parse those files without any 
pre-processing steps. 

CommonCrawlDataDumper is calling the following to dump with cbor.
import com.fasterxml.jackson.dataformat.cbor.CBORFactory;
import com.fasterxml.jackson.dataformat.cbor.CBORGenerator;

fasterxml is a 3rd party library for converting json to .cbor and Vice Versa.

According to RFC 7049 (http://tools.ietf.org/html/rfc7049), it looks like CBOR 
does not yet have its magic numbers to be detected/identified by other 
applications (PFA: rfc_cbor.jpg)
It seems that the only way to inform other applications of the type as of now 
is using the extension (i.e. .cbor), or probably content detection (i.e. byte 
histogram distribution estimation).  

There is another thing worth the attention, it looks like tika has attempted to 
add the support with cbor mime detection in the tika-mimetypes.xml 
(PFA:cbor_tika.mimetypes.xml.jpg); This detection is not working with the cbor 
file dumped by CommonCrawlDataDumper. 
According to http://tools.ietf.org/html/rfc7049, there is a self-describing Tag 
55799 that seems to be used for cbor type identification, but it is probably up 
to the application that take care of this tag, and it is also possible that the 
fasterxml is not missing this tag. 

On the other hand, it is worth noting that the entries for cbor extension 
detection needs to be appended in the tika-mimetypes.xml too 
glob pattern=*.cbor/





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-1610) CBOR Parser and detection improvement

2015-04-21 Thread Luke sh (JIRA)

[
https://issues.apache.org/jira/browse/TIKA-1610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Luke sh updated TIKA-1610:
--
Description:
CBOR is a data format whose design goals include the possibility of extremely
small code size, fairly small message size, and extensibility without the need
for version negotiation (cited from http://cbor.io/ ).

It would be great if Tika is able to provide the support with CBOR parser and
identification. In the current project with Nutch, the Nutch
CommonCrawlDataDumper is used to dump the crawled segments to the files in the
format of CBOR. In order to read/parse those dumped files by this tool, it
would be great if tika is able to support parsing the cbor, the thing is that
the CommonCrawlDataDumper is not dumping with correct extension, it dumps with
its own rule, the default extension of the dumped file is html, so it might be
less painful if tika is able to detect and parse those files without any
pre-processing steps.

CommonCrawlDataDumper is calling the following to dump with cbor.
import com.fasterxml.jackson.dataformat.cbor.CBORFactory;
import com.fasterxml.jackson.dataformat.cbor.CBORGenerator;

fasterxml is a 3rd party library for converting json to .cbor and Vice Versa.

According to RFC 7049 (http://tools.ietf.org/html/rfc7049), it looks like CBOR
does not yet have its magic numbers to be detected/identified by other
applications (PFA: rfc_cbor.jpg)
It seems that the only way to inform other applications of the type as of now
is using the extension (i.e. .cbor), or probably content detection (i.e. byte
histogram distribution estimation).

There is another thing worth the attention, it looks like tika has attempted to
add the support with cbor mime detection in the tika-mimetypes.xml
(PFA:cbor_tika.mimetypes.xml.jpg); This detection is not working with the cbor
file dumped by CommonCrawlDataDumper.
According to http://tools.ietf.org/html/rfc7049#section-2.4.5, there is a
self-describing Tag 55799 that seems to be used for cbor type
identification(the hex code might be 0xd9d9f7), but it is probably up to the
application that take care of this tag, and it is also possible that the
fasterxml that the nutch dumping tool is missing this tag, an example cbor file
dumped by the Nutch tool i.e. CommonCrawlDataDumper has also been attached
(PFA: 142440269.html).
The following info is cited from the rfc, ...a decoder might be able to parse
both CBOR and JSON.
Such a decoder would need to mechanically distinguish the two
formats. An easy way for an encoder to help the decoder would be to
tag the entire CBOR item with tag 55799, the serialization of which
will never be found at the beginning of a JSON text...
It looks like the a file can have two parts/sections i.e. the plain text parts
and the json prettified by cbor, this might be also worth the attention and
consideration with the parsing and type identification.

On the other hand, it is worth noting that the entries for cbor extension
detection needs to be appended in the tika-mimetypes.xml too
e.g.
glob pattern=*.cbor/

was:
CBOR is a data format whose design goals include the possibility of extremely
small code size, fairly small message size, and extensibility without the need
for version negotiation (cited from http://cbor.io/ ).

CommonCrawlDataDumper is calling the following to dump with cbor.
import com.fasterxml.jackson.dataformat.cbor.CBORFactory;
import com.fasterxml.jackson.dataformat.cbor.CBORGenerator;

fasterxml is a 3rd party library for converting json to .cbor and Vice Versa.

[jira] [Updated] (TIKA-1610) CBOR Parser and detection improvement

2015-04-21 Thread Luke sh (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luke sh updated TIKA-1610:
--
Description: 
CBOR is a data format whose design goals include the possibility of extremely 
small code size, fairly small message size, and extensibility without the need 
for version negotiation (cited from http://cbor.io/ ).

It would be great if Tika is able to provide the support with CBOR parser and 
identification. In the current project with Nutch, the Nutch 
CommonCrawlDataDumper is used to dump the crawled segments to the files in the 
format of CBOR. In order to read/parse those dumped files by this tool, it 
would be great if tika is able to support parsing the cbor, the thing is that 
the CommonCrawlDataDumper is not dumping with correct extension, it dumps with 
its own rule, the default extension of the dumped file is html, so it might be 
less painful if tika is able to detect and parse those files without any 
pre-processing steps. 

CommonCrawlDataDumper is calling the following to dump with cbor.
import com.fasterxml.jackson.dataformat.cbor.CBORFactory;
import com.fasterxml.jackson.dataformat.cbor.CBORGenerator;

fasterxml is a 3rd party library for converting json to .cbor and Vice Versa.

According to RFC 7049 (http://tools.ietf.org/html/rfc7049), it looks like CBOR 
does not yet have its magic numbers to be detected/identified by other 
applications (PFA: rfc_cbor.jpg)
It seems that the only way to inform other applications of the type as of now 
is using the extension (i.e. .cbor), or probably content detection (i.e. byte 
histogram distribution estimation).  

There is another thing worth the attention, it looks like tika has attempted to 
add the support with cbor mime detection in the tika-mimetypes.xml 
(PFA:cbor_tika.mimetypes.xml.jpg); This detection is not working with the cbor 
file dumped by CommonCrawlDataDumper. 
According to http://tools.ietf.org/html/rfc7049, there is a self-describing Tag 
55799 that seems to be used for cbor type identification, but it is probably up 
to the application that take care of this tag, and it is also possible that the 
fasterxml is missing this tag, an example cbor file dumped by the Nutch tool 
i.e. CommonCrawlDataDumper has also been attached (PFA: 142440269.html)

On the other hand, it is worth noting that the entries for cbor extension 
detection needs to be appended in the tika-mimetypes.xml too 
glob pattern=*.cbor/



  was:
CBOR is a data format whose design goals include the possibility of extremely 
small code size, fairly small message size, and extensibility without the need 
for version negotiation (cited from http://cbor.io/ ).

It would be great if Tika is able to provide the support with CBOR parser and 
identification. In the current project with Nutch, the Nutch 
CommonCrawlDataDumper is used to dump the crawled segments to the files in the 
format of CBOR. In order to read/parse those dumped files by this tool, it 
would be great if tika is able to support parsing the cbor, the thing is that 
the CommonCrawlDataDumper is not dumping with correct extension, it dumps with 
its own rule, the default extension of the dumped file is html, so it might be 
less painful if tika is able to detect and parse those files without any 
pre-processing steps. 

CommonCrawlDataDumper is calling the following to dump with cbor.
import com.fasterxml.jackson.dataformat.cbor.CBORFactory;
import com.fasterxml.jackson.dataformat.cbor.CBORGenerator;

fasterxml is a 3rd party library for converting json to .cbor and Vice Versa.

According to RFC 7049 (http://tools.ietf.org/html/rfc7049), it looks like CBOR 
does not yet have its magic numbers to be detected/identified by other 
applications (PFA: rfc_cbor.jpg)
It seems that the only way to inform other applications of the type as of now 
is using the extension (i.e. .cbor), or probably content detection (i.e. byte 
histogram distribution estimation).  

There is another thing worth the attention, it looks like tika has attempted to 
add the support with cbor mime detection in the tika-mimetypes.xml 
(PFA:cbor_tika.mimetypes.xml.jpg); This detection is not working with the cbor 
file dumped by CommonCrawlDataDumper. 
According to http://tools.ietf.org/html/rfc7049, there is a self-describing Tag 
55799 that seems to be used for cbor type identification, but it is probably up 
to the application that take care of this tag, and it is also possible that the 
fasterxml is not missing this tag. 

On the other hand, it is worth noting that the entries for cbor extension 
detection needs to be appended in the tika-mimetypes.xml too 
glob pattern=*.cbor/




 CBOR Parser and detection improvement
 -

 Key: TIKA-1610
 URL: https://issues.apache.org/jira/browse/TIKA-1610
 Project: Tika
  Issue Type: New Feature
  Components:

[jira] [Updated] (TIKA-1610) CBOR Parser and detection improvement

2015-04-21 Thread Luke sh (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luke sh updated TIKA-1610:
--
Description: 
CBOR is a data format whose design goals include the possibility of extremely 
small code size, fairly small message size, and extensibility without the need 
for version negotiation (cited from http://cbor.io/ ).

It would be great if Tika is able to provide the support with CBOR parser and 
identification. In the current project with Nutch, the Nutch 
CommonCrawlDataDumper is used to dump the crawled segments to the files in the 
format of CBOR. In order to read/parse those dumped files by this tool, it 
would be great if tika is able to support parsing the cbor, the thing is that 
the CommonCrawlDataDumper is not dumping with correct extension, it dumps with 
its own rule, the default extension of the dumped file is html, so it might be 
less painful if tika is able to detect and parse those files without any 
pre-processing steps. 

CommonCrawlDataDumper is calling the following to dump with cbor.
import com.fasterxml.jackson.dataformat.cbor.CBORFactory;
import com.fasterxml.jackson.dataformat.cbor.CBORGenerator;

fasterxml is a 3rd party library for converting json to .cbor and Vice Versa.

According to RFC 7049 (http://tools.ietf.org/html/rfc7049), it looks like CBOR 
does not yet have its magic numbers to be detected/identified by other 
applications (PFA: rfc_cbor.jpg)
It seems that the only way to inform other applications of the type as of now 
is using the extension (i.e. .cbor), or probably content detection (i.e. byte 
histogram distribution estimation).  

There is another thing worth the attention, it looks like tika has attempted to 
add the support with cbor mime detection in the tika-mimetypes.xml 
(PFA:cbor_tika.mimetypes.xml.jpg); This detection is not working with the cbor 
file dumped by CommonCrawlDataDumper. 
According to http://tools.ietf.org/html/rfc7049#section-2.4.5, there is a 
self-describing Tag 55799 that seems to be used for cbor type identification, 
but it is probably up to the application that take care of this tag, and it is 
also possible that the fasterxml is missing this tag, an example cbor file 
dumped by the Nutch tool i.e. CommonCrawlDataDumper has also been attached 
(PFA: 142440269.html)

On the other hand, it is worth noting that the entries for cbor extension 
detection needs to be appended in the tika-mimetypes.xml too 
glob pattern=*.cbor/



  was:
CBOR is a data format whose design goals include the possibility of extremely 
small code size, fairly small message size, and extensibility without the need 
for version negotiation (cited from http://cbor.io/ ).

It would be great if Tika is able to provide the support with CBOR parser and 
identification. In the current project with Nutch, the Nutch 
CommonCrawlDataDumper is used to dump the crawled segments to the files in the 
format of CBOR. In order to read/parse those dumped files by this tool, it 
would be great if tika is able to support parsing the cbor, the thing is that 
the CommonCrawlDataDumper is not dumping with correct extension, it dumps with 
its own rule, the default extension of the dumped file is html, so it might be 
less painful if tika is able to detect and parse those files without any 
pre-processing steps. 

CommonCrawlDataDumper is calling the following to dump with cbor.
import com.fasterxml.jackson.dataformat.cbor.CBORFactory;
import com.fasterxml.jackson.dataformat.cbor.CBORGenerator;

fasterxml is a 3rd party library for converting json to .cbor and Vice Versa.

According to RFC 7049 (http://tools.ietf.org/html/rfc7049), it looks like CBOR 
does not yet have its magic numbers to be detected/identified by other 
applications (PFA: rfc_cbor.jpg)
It seems that the only way to inform other applications of the type as of now 
is using the extension (i.e. .cbor), or probably content detection (i.e. byte 
histogram distribution estimation).  

There is another thing worth the attention, it looks like tika has attempted to 
add the support with cbor mime detection in the tika-mimetypes.xml 
(PFA:cbor_tika.mimetypes.xml.jpg); This detection is not working with the cbor 
file dumped by CommonCrawlDataDumper. 
According to http://tools.ietf.org/html/rfc7049, there is a self-describing Tag 
55799 that seems to be used for cbor type identification, but it is probably up 
to the application that take care of this tag, and it is also possible that the 
fasterxml is missing this tag, an example cbor file dumped by the Nutch tool 
i.e. CommonCrawlDataDumper has also been attached (PFA: 142440269.html)

On the other hand, it is worth noting that the entries for cbor extension 
detection needs to be appended in the tika-mimetypes.xml too 
glob pattern=*.cbor/




 CBOR Parser and detection improvement
 -

 Key: TIKA-1610
 URL:

[jira] [Updated] (TIKA-1610) CBOR Parser and detection improvement

2015-04-21 Thread Luke sh (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luke sh updated TIKA-1610:
--
Attachment: 142440269.html

cbor file dumped by the nutch tool.

 CBOR Parser and detection improvement
 -

 Key: TIKA-1610
 URL: https://issues.apache.org/jira/browse/TIKA-1610
 Project: Tika
  Issue Type: New Feature
  Components: detector, mime, parser
Affects Versions: 1.7
Reporter: Luke sh
Priority: Trivial
  Labels: memex
 Attachments: 142440269.html, cbor_tika.mimetypes.xml.jpg, 
 rfc_cbor.jpg


 CBOR is a data format whose design goals include the possibility of extremely 
 small code size, fairly small message size, and extensibility without the 
 need for version negotiation (cited from http://cbor.io/ ).
 It would be great if Tika is able to provide the support with CBOR parser and 
 identification. In the current project with Nutch, the Nutch 
 CommonCrawlDataDumper is used to dump the crawled segments to the files in 
 the format of CBOR. In order to read/parse those dumped files by this tool, 
 it would be great if tika is able to support parsing the cbor, the thing is 
 that the CommonCrawlDataDumper is not dumping with correct extension, it 
 dumps with its own rule, the default extension of the dumped file is html, so 
 it might be less painful if tika is able to detect and parse those files 
 without any pre-processing steps. 
 CommonCrawlDataDumper is calling the following to dump with cbor.
 import com.fasterxml.jackson.dataformat.cbor.CBORFactory;
 import com.fasterxml.jackson.dataformat.cbor.CBORGenerator;
 fasterxml is a 3rd party library for converting json to .cbor and Vice Versa.
 According to RFC 7049 (http://tools.ietf.org/html/rfc7049), it looks like 
 CBOR does not yet have its magic numbers to be detected/identified by other 
 applications (PFA: rfc_cbor.jpg)
 It seems that the only way to inform other applications of the type as of now 
 is using the extension (i.e. .cbor), or probably content detection (i.e. byte 
 histogram distribution estimation).  
 There is another thing worth the attention, it looks like tika has attempted 
 to add the support with cbor mime detection in the tika-mimetypes.xml 
 (PFA:cbor_tika.mimetypes.xml.jpg); This detection is not working with the 
 cbor file dumped by CommonCrawlDataDumper. 
 According to http://tools.ietf.org/html/rfc7049, there is a self-describing 
 Tag 55799 that seems to be used for cbor type identification, but it is 
 probably up to the application that take care of this tag, and it is also 
 possible that the fasterxml is missing this tag, an example cbor file dumped 
 by the Nutch tool i.e. CommonCrawlDataDumper has also been attached (PFA: 
 142440269.html)
 On the other hand, it is worth noting that the entries for cbor extension 
 detection needs to be appended in the tika-mimetypes.xml too 
 glob pattern=*.cbor/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-1610) CBOR Parser and detection [improvement]

2015-04-21 Thread Luke sh (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luke sh updated TIKA-1610:
--
Summary: CBOR Parser and detection [improvement]  (was: CBOR Parser and 
detection improvement)

 CBOR Parser and detection [improvement]
 ---

 Key: TIKA-1610
 URL: https://issues.apache.org/jira/browse/TIKA-1610
 Project: Tika
  Issue Type: New Feature
  Components: detector, mime, parser
Affects Versions: 1.7
Reporter: Luke sh
Priority: Trivial
  Labels: memex
 Attachments: 142440269.html, cbor_tika.mimetypes.xml.jpg, 
 rfc_cbor.jpg


 CBOR is a data format whose design goals include the possibility of extremely 
 small code size, fairly small message size, and extensibility without the 
 need for version negotiation (cited from http://cbor.io/ ).
 It would be great if Tika is able to provide the support with CBOR parser and 
 identification. In the current project with Nutch, the Nutch 
 CommonCrawlDataDumper is used to dump the crawled segments to the files in 
 the format of CBOR. In order to read/parse those dumped files by this tool, 
 it would be great if tika is able to support parsing the cbor, the thing is 
 that the CommonCrawlDataDumper is not dumping with correct extension, it 
 dumps with its own rule, the default extension of the dumped file is html, so 
 it might be less painful if tika is able to detect and parse those files 
 without any pre-processing steps. 
 CommonCrawlDataDumper is calling the following to dump with cbor.
 import com.fasterxml.jackson.dataformat.cbor.CBORFactory;
 import com.fasterxml.jackson.dataformat.cbor.CBORGenerator;
 fasterxml is a 3rd party library for converting json to .cbor and Vice Versa.
 According to RFC 7049 (http://tools.ietf.org/html/rfc7049), it looks like 
 CBOR does not yet have its magic numbers to be detected/identified by other 
 applications (PFA: rfc_cbor.jpg)
 It seems that the only way to inform other applications of the type as of now 
 is using the extension (i.e. .cbor), or probably content detection (i.e. byte 
 histogram distribution estimation).  
 There is another thing worth the attention, it looks like tika has attempted 
 to add the support with cbor mime detection in the tika-mimetypes.xml 
 (PFA:cbor_tika.mimetypes.xml.jpg); This detection is not working with the 
 cbor file dumped by CommonCrawlDataDumper. 
 According to http://tools.ietf.org/html/rfc7049#section-2.4.5, there is a 
 self-describing Tag 55799 that seems to be used for cbor type 
 identification(the hex code might be 0xd9d9f7), but it is probably up to the 
 application that take care of this tag, and it is also possible that the 
 fasterxml that the nutch dumping tool is missing this tag, an example cbor 
 file dumped by the Nutch tool i.e. CommonCrawlDataDumper has also been 
 attached (PFA: 142440269.html).
 The following info is cited from the rfc, ...a decoder might be able to 
 parse both CBOR and JSON.
Such a decoder would need to mechanically distinguish the two
formats.  An easy way for an encoder to help the decoder would be to
tag the entire CBOR item with tag 55799, the serialization of which
will never be found at the beginning of a JSON text...
 It looks like the a file can have two parts/sections i.e. the plain text 
 parts and the json prettified by cbor, this might be also worth the attention 
 and consideration with the parsing and type identification.
 On the other hand, it is worth noting that the entries for cbor extension 
 detection needs to be appended in the tika-mimetypes.xml too 
 e.g.
 glob pattern=*.cbor/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (TIKA-1517) MIME type selection with probability

2015-04-01 Thread Luke sh (JIRA)

[
https://issues.apache.org/jira/browse/TIKA-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391476#comment-14391476
]

Luke sh edited comment on TIKA-1517 at 4/2/15 1:30 AM:
---

After some research, it looks like the algorithm design with probabilistic mime
type selection seems to cause some confusion vs Naive Bayesian, the idea is
borrowed from Naive Bayesian, but it turns out giving up on some properties of
Naive Bayesian seems to cause some confusions and therefore less intuitive
although some computations are skimped.

One of the problems i have been considering is that when magic test fails to
determine a type, it returns byte-stream, in the original design, the
original design approaches this by taking byte-stream as a decision, so e.g.
when the extension test returns a correct type (e.g. GRB) but the magic test
returns the byte-stream meaning it fails to detect the type, the question is
whether we should take byte-stream as part of the decisions. After thinking
about this this week for some time, i decide to ignore this byte-stream
predicted by a test when taking the vote for which type should be used.
e.g.
magic test : Byte-Stream (failed to detect the type)
extension test: GRB
meta hint: none(Byte-Stream)
The original design of the system in this case is expected to return
Byte-Stream as the final decision for the type detection when magic test trust
values(i.e. the presumed conditional probabilities) are set to high.

secondly, after thinking for a while, i tend to think giving up on the prior
might not be a good idea even though it simplifies a bit the computation, but
that causes a bit of confusion and less intuitive.
The intuition is that we probably can treat a detected type as a cause whose
prior is 50% percent of correctness, i.e. 50% of chance that the detected type
is correct, this seems to be more intuitive than completely ignoring the prior.

After thinking about this problem for a while, it seems there are still some
space where the algorithm can improved and optimized.
Original design is intertwined with the consideration on computations which
seems to cause some confusion, but actually the causal reasoning and intuition
might be a bit more important, i will also be optimizing and correcting some of
the factors which seems to be less appropriate in the original design.

I am working on the improvement and conducting the research with pros and cons;
if any thoughts, please kindly let me know.

was (Author: lukeliush):
After some research, it looks like the algorithm design with probabilistic mime
type selection seems to cause some confusion vs Naive Bayesian, the idea is
borrowed from Naive Bayesian, but it turns out giving up on some properties of
Naive Bayesian seems to cause some confusions and therefore less intuitive
although some computations are skimped.

I am working on the improvement; if any thoughts, please kindly let me know.

MIME type selection with probability

Key:

[jira] [Commented] (TIKA-1517) MIME type selection with probability

2015-04-01 Thread Luke sh (JIRA)

[
https://issues.apache.org/jira/browse/TIKA-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391476#comment-14391476
]

Luke sh commented on TIKA-1517:
---

After some research, it looks like the algorithm design with probabilistic mime
type selection seems to be cause some confusion vs Naive Bayesian, the idea is
borrowed from Naive Bayesian, but it turns out giving up on some properties of
Naive Bayesian seems to cause some confusions and therefore less intuitive
although some computations are skimped.

One of the problems i have been considering is that when magic test fails to
determine a type, it returns byte-stream, in the original design, the
original design approaches this by taking byte-stream as a decision, so e.g.
when the extension test returns a correct type (e.g. GRB) but the magic test
returns the byte-stream meaning it fails to detect the type, the question is
should we take byte-stream as part of the decisions. After thinking about this
this week for some time, i decide to ignore this byte-stream predicted by a
test when taking vote for decision on which type should be used.
e.g.
magic test : Byte-Stream
extension test: GRB
meta hint: none(Byte-Stream)
The original design of the system is expected to return Byte-Stream as the
final decision for the type detection when magic test trust values(i.e. the
presumed conditional probabilities) are set to high.

After thinking about this problem for a while, it seems there are still some
space to be corrected and optimized.
Original design is intertwined with the consideration on computations which
seems to cause some confusion, but actually the causal reasoning and intuition
might be a bit more important, i will also be optimizing and correcting some of
the factors which seems to be less appropriate in the original design.

I am working on the improvement; if any thoughts, please kindly let me know.

MIME type selection with probability

Key: TIKA-1517
URL: https://issues.apache.org/jira/browse/TIKA-1517
Project: Tika
Issue Type: Improvement
Components: mime
Affects Versions: 0.1-incubating, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9,
0.10, 1.0, 1.1, 1.2, 1.3, 1.4, 1.5, 1.6
Reporter: Luke sh
Attachments: BaysianTest.java

Improvement and intuition
The original implementation for MIME type selection/detection is a bit less
flexible by initial design, as it heavily relies on the outcome produced by
magic-bytes MIME Type identification; Thus e.g. if magic-bytes is applicable
in a file, Tika will follow the file type detected by magic-bytes. It may be
better to provide more control over the method of choice.
This proposed approach slightly incorporate the Bayesian probability theorem,
where users are able to assign weights to each approach in terms of
probability, so they have the control over preference of which file type or
mime type identification methods implemented/available in Tika, and currently
there are 3 methods for identifying MIME type in Tika (i.e. Magic-Bytes, File
extension and Metadata content-type hint). By introducing some weights on the
approach in the proposed approach, users are able to choose which method they
trust most, the magic-bytes method is often trust-worthy though. But the
virtue is that in some situations, file type identification must be
sensitive, some might want all of the MIME type identification methods to
agree on the same file type before they start processing those files,
incorrect file type identification is less intolerable. The current
implementation seems to be less flexible for this purpose and heavily rely on
the Magic-bytes file identification method (although magic-bytes is most
reliable compared to the other 2 );
Proposed design:
The idea of selection is to incorporate probability as weights on each MIME
type identification method currently being implemented in Tika (they are
Magic bytes approach, file extension match and metadata content-type hint).
for example,
as an user, i would probably like to assign the the preference to the method
based on the degree of the trust, and order the results if they don't
coincide.
Bayesian rule may be a bit appropriate here to meet the intuition.
The following is what are needed for Bayesian rule implementation.
Prior probability

[jira] [Comment Edited] (TIKA-1517) MIME type selection with probability

2015-04-01 Thread Luke sh (JIRA)

[
https://issues.apache.org/jira/browse/TIKA-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391476#comment-14391476
]

Luke sh edited comment on TIKA-1517 at 4/1/15 10:38 PM:

One of the problems i have been considering is that when magic test fails to
determine a type, it returns byte-stream, in the original design, the
original design approaches this by taking byte-stream as a decision, so e.g.
when the extension test returns a correct type (e.g. GRB) but the magic test
returns the byte-stream meaning it fails to detect the type, the question is
whether we should take byte-stream as part of the decisions. After thinking
about this this week for some time, i decide to ignore this byte-stream
predicted by a test when taking vote for decision on which type should be used.
e.g.
magic test : Byte-Stream
extension test: GRB
meta hint: none(Byte-Stream)
The original design of the system is expected to return Byte-Stream as the
final decision for the type detection when magic test trust values(i.e. the
presumed conditional probabilities) are set to high.

I am working on the improvement; if any thoughts, please kindly let me know.

One of the problems i have been considering is that when magic test fails to
determine a type, it returns byte-stream, in the original design, the
original design approaches this by taking byte-stream as a decision, so e.g.
when the extension test returns a correct type (e.g. GRB) but the magic test
returns the byte-stream meaning it fails to detect the type, the question is
should we take byte-stream as part of the decisions. After thinking about this
this week for some time, i decide to ignore this byte-stream predicted by a
test when taking vote for decision on which type should be used.
e.g.
magic test : Byte-Stream
extension test: GRB
meta hint: none(Byte-Stream)
The original design of the system is expected to return Byte-Stream as the
final decision for the type detection when magic test trust values(i.e. the
presumed conditional probabilities) are set to high.

I am working on the improvement; if any thoughts, please kindly let me know.

MIME type selection with probability

Key: TIKA-1517
URL: https://issues.apache.org/jira/browse/TIKA-1517
Project: Tika
Issue Type: Improvement
Components:

[jira] [Comment Edited] (TIKA-1517) MIME type selection with probability

2015-04-01 Thread Luke sh (JIRA)

[
https://issues.apache.org/jira/browse/TIKA-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391476#comment-14391476
]

Luke sh edited comment on TIKA-1517 at 4/1/15 10:37 PM:

One of the problems i have been considering is that when magic test fails to
determine a type, it returns byte-stream, in the original design, the
original design approaches this by taking byte-stream as a decision, so e.g.
when the extension test returns a correct type (e.g. GRB) but the magic test
returns the byte-stream meaning it fails to detect the type, the question is
should we take byte-stream as part of the decisions. After thinking about this
this week for some time, i decide to ignore this byte-stream predicted by a
test when taking vote for decision on which type should be used.
e.g.
magic test : Byte-Stream
extension test: GRB
meta hint: none(Byte-Stream)
The original design of the system is expected to return Byte-Stream as the
final decision for the type detection when magic test trust values(i.e. the
presumed conditional probabilities) are set to high.

I am working on the improvement; if any thoughts, please kindly let me know.

was (Author: lukeliush):
After some research, it looks like the algorithm design with probabilistic mime
type selection seems to be cause some confusion vs Naive Bayesian, the idea is
borrowed from Naive Bayesian, but it turns out giving up on some properties of
Naive Bayesian seems to cause some confusions and therefore less intuitive
although some computations are skimped.

One of the problems i have been considering is that when magic test fails to
determine a type, it returns byte-stream, in the original design, the
original design approaches this by taking byte-stream as a decision, so e.g.
when the extension test returns a correct type (e.g. GRB) but the magic test
returns the byte-stream meaning it fails to detect the type, the question is
should we take byte-stream as part of the decisions. After thinking about this
this week for some time, i decide to ignore this byte-stream predicted by a
test when taking vote for decision on which type should be used.
e.g.
magic test : Byte-Stream
extension test: GRB
meta hint: none(Byte-Stream)
The original design of the system is expected to return Byte-Stream as the
final decision for the type detection when magic test trust values(i.e. the
presumed conditional probabilities) are set to high.

I am working on the improvement; if any thoughts, please kindly let me know.

MIME type selection with probability

Key: TIKA-1517
URL: https://issues.apache.org/jira/browse/TIKA-1517
Project: Tika
Issue Type: Improvement
Components: mime

[jira] [Comment Edited] (TIKA-1517) MIME type selection with probability

2015-04-01 Thread Luke sh (JIRA)

[
https://issues.apache.org/jira/browse/TIKA-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391476#comment-14391476
]

Luke sh edited comment on TIKA-1517 at 4/1/15 10:39 PM:

One of the problems i have been considering is that when magic test fails to
determine a type, it returns byte-stream, in the original design, the
original design approaches this by taking byte-stream as a decision, so e.g.
when the extension test returns a correct type (e.g. GRB) but the magic test
returns the byte-stream meaning it fails to detect the type, the question is
whether we should take byte-stream as part of the decisions. After thinking
about this this week for some time, i decide to ignore this byte-stream
predicted by a test when taking the vote for which type should be used.
e.g.
magic test : Byte-Stream
extension test: GRB
meta hint: none(Byte-Stream)
The original design of the system is expected to return Byte-Stream as the
final decision for the type detection when magic test trust values(i.e. the
presumed conditional probabilities) are set to high.

I am working on the improvement; if any thoughts, please kindly let me know.

One of the problems i have been considering is that when magic test fails to
determine a type, it returns byte-stream, in the original design, the
original design approaches this by taking byte-stream as a decision, so e.g.
when the extension test returns a correct type (e.g. GRB) but the magic test
returns the byte-stream meaning it fails to detect the type, the question is
whether we should take byte-stream as part of the decisions. After thinking
about this this week for some time, i decide to ignore this byte-stream
predicted by a test when taking vote for decision on which type should be used.
e.g.
magic test : Byte-Stream
extension test: GRB
meta hint: none(Byte-Stream)
The original design of the system is expected to return Byte-Stream as the
final decision for the type detection when magic test trust values(i.e. the
presumed conditional probabilities) are set to high.

I am working on the improvement; if any thoughts, please kindly let me know.

MIME type selection with probability

Key: TIKA-1517
URL: https://issues.apache.org/jira/browse/TIKA-1517
Project: Tika
Issue Type: Improvement
Components:

[jira] [Comment Edited] (TIKA-1517) MIME type selection with probability

2015-04-01 Thread Luke sh (JIRA)

[
https://issues.apache.org/jira/browse/TIKA-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391476#comment-14391476
]

Luke sh edited comment on TIKA-1517 at 4/1/15 10:39 PM:

One of the problems i have been considering is that when magic test fails to
determine a type, it returns byte-stream, in the original design, the
original design approaches this by taking byte-stream as a decision, so e.g.
when the extension test returns a correct type (e.g. GRB) but the magic test
returns the byte-stream meaning it fails to detect the type, the question is
whether we should take byte-stream as part of the decisions. After thinking
about this this week for some time, i decide to ignore this byte-stream
predicted by a test when taking the vote for which type should be used.
e.g.
magic test : Byte-Stream (failed to detect the type)
extension test: GRB
meta hint: none(Byte-Stream)
The original design of the system is expected to return Byte-Stream as the
final decision for the type detection when magic test trust values(i.e. the
presumed conditional probabilities) are set to high.

I am working on the improvement; if any thoughts, please kindly let me know.

One of the problems i have been considering is that when magic test fails to
determine a type, it returns byte-stream, in the original design, the
original design approaches this by taking byte-stream as a decision, so e.g.
when the extension test returns a correct type (e.g. GRB) but the magic test
returns the byte-stream meaning it fails to detect the type, the question is
whether we should take byte-stream as part of the decisions. After thinking
about this this week for some time, i decide to ignore this byte-stream
predicted by a test when taking the vote for which type should be used.
e.g.
magic test : Byte-Stream
extension test: GRB
meta hint: none(Byte-Stream)
The original design of the system is expected to return Byte-Stream as the
final decision for the type detection when magic test trust values(i.e. the
presumed conditional probabilities) are set to high.

I am working on the improvement; if any thoughts, please kindly let me know.

MIME type selection with probability

Key: TIKA-1517
URL: https://issues.apache.org/jira/browse/TIKA-1517
Project: Tika
Issue Type: Improvement

[jira] [Comment Edited] (TIKA-1517) MIME type selection with probability

2015-04-01 Thread Luke sh (JIRA)

[
https://issues.apache.org/jira/browse/TIKA-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391476#comment-14391476
]

Luke sh edited comment on TIKA-1517 at 4/1/15 10:41 PM:

I am working on the improvement; if any thoughts, please kindly let me know.

MIME type selection with probability

Key: TIKA-1517
URL:

[jira] [Comment Edited] (TIKA-1517) MIME type selection with probability

2015-04-01 Thread Luke sh (JIRA)

[
https://issues.apache.org/jira/browse/TIKA-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391476#comment-14391476
]

Luke sh edited comment on TIKA-1517 at 4/1/15 10:39 PM:

I am working on the improvement; if any thoughts, please kindly let me know.

One of the problems i have been considering is that when magic test fails to
determine a type, it returns byte-stream, in the original design, the
original design approaches this by taking byte-stream as a decision, so e.g.
when the extension test returns a correct type (e.g. GRB) but the magic test
returns the byte-stream meaning it fails to detect the type, the question is
whether we should take byte-stream as part of the decisions. After thinking
about this this week for some time, i decide to ignore this byte-stream
predicted by a test when taking the vote for which type should be used.
e.g.
magic test : Byte-Stream (failed to detect the type)
extension test: GRB
meta hint: none(Byte-Stream)
The original design of the system is expected to return Byte-Stream as the
final decision for the type detection when magic test trust values(i.e. the
presumed conditional probabilities) are set to high.

I am working on the improvement; if any thoughts, please kindly let me know.

MIME type selection with probability

Key: TIKA-1517
URL: https://issues.apache.org/jira/browse/TIKA-1517
Project:

[jira] [Updated] (TIKA-1517) MIME type selection with probability

2015-04-01 Thread Luke sh (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luke sh updated TIKA-1517:
--
Priority: Trivial  (was: Major)

 MIME type selection with probability
 

 Key: TIKA-1517
 URL: https://issues.apache.org/jira/browse/TIKA-1517
 Project: Tika
  Issue Type: Improvement
  Components: mime
Affects Versions: 0.1-incubating, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 
 0.10, 1.0, 1.1, 1.2, 1.3, 1.4, 1.5, 1.6
Reporter: Luke sh
Priority: Trivial
 Attachments: BaysianTest.java


 Improvement and intuition
 The original implementation for MIME type selection/detection is a bit less 
 flexible by initial design, as it heavily relies on the outcome produced by 
 magic-bytes MIME Type identification; Thus e.g. if magic-bytes is applicable 
 in a file, Tika will follow the file type detected by magic-bytes. It may be 
 better to provide more control over the method of choice.
 This proposed approach slightly incorporate the Bayesian probability theorem, 
 where users are able to assign weights to each approach in terms of 
 probability, so they have the control over preference of which file type or 
 mime type identification methods implemented/available in Tika, and currently 
 there are 3 methods for identifying MIME type in Tika (i.e. Magic-Bytes, File 
 extension and Metadata content-type hint). By introducing some weights on the 
 approach in the proposed approach, users are able to choose which method they 
 trust most, the magic-bytes method is often trust-worthy though. But the 
 virtue is that in some situations, file type identification must be 
 sensitive, some might want all of the MIME type identification methods to 
 agree on the same file type before they start processing those files, 
 incorrect file type identification is less intolerable. The current 
 implementation seems to be less flexible for this purpose and heavily rely on 
 the Magic-bytes file identification method (although magic-bytes is most 
 reliable compared to the other 2 ); 
 Proposed design:
 The idea of selection is to incorporate probability as weights on each MIME 
 type identification method currently being implemented in Tika (they are 
 Magic bytes approach, file extension match and metadata content-type hint).
 for example,
 as an user, i would probably like to assign the the preference to the method 
 based on the degree of the trust, and order the results if they don't 
 coincide.
 Bayesian rule may be a bit appropriate here to meet the intuition.
 The following is what are needed for Bayesian rule implementation.
  Prior probability P(file_type) e.g. P(pdf), theoretically this is computed 
  based on the samples, and this depends on the domain or use cases, 
  intuitively we more care the orders of the weights or probability of the 
  results rather than the actual numbers, and also the context of Prior 
  depends on samples for a particular use case or domain, e.g. if we happen 
  to crawl a website that contains mostly the pdf files, we probably can 
  collect some samples and compute the prior, based on the samples we can say 
  90% of docs are pdf, so our prior is defined to be P(pdf) = 0.9, but here 
  we propose to define the prior as configurable param for users, and by 
  default we leave the prior to be unapplicable. Alternatively, we can 
  define prior for each file type to be  1/[number of supported file types in 
  Tika] I think the number would be approximately 1/1157 and using this 
  number seems to be more fair, but the point of avoiding it is that this 
  prior is fixed for every type, and eventually we care more the orders of 
  the result and if the number is fixed, so will the order be, bringing this 
  number of 1/1157 into the Bayesian equation will not only be unable to 
  affect the order but also it will lumber our implementation with extra 
  computation, thus we will leave it as unapplicable which means we assign 
  1 to it as it never exists! but note we care more the order rather the 
  actual number, and this param is configurable, and we believe it provides 
  much flexibilities in some use cases.
  Conditional probability of positive tests given a file type P(test| 
  file_type) e.g. P(test1 = pdf | pdf), this probability is also based on 
  collection of samples and domain or use cases, we leave it configurable, 
  but based on our intuition we think test1(i.e. Magic-bytes method) is most 
  trustworthy, thus the default value is 0.75 for P(test1 = a_file_type | 
  a_file_type), this is to say given the file whose type is a file type, 
  the probability of the test1 predicting the file is a_file_type is 0.75, 
  that is really our intuition, as we trust test1 most, next we propose to 
  use 0.7 for test3, and 0.65 for test2;
 (note again, test1 = magic-bytes, test2

[jira] [Updated] (TIKA-1582) Mime Detection based on neural networks with Byte-frequency-histogram

2015-03-28 Thread Luke sh (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luke sh updated TIKA-1582:
--
Attachment: nnmodel.docx

Documentation 

 Mime Detection based on neural networks with Byte-frequency-histogram 
 --

 Key: TIKA-1582
 URL: https://issues.apache.org/jira/browse/TIKA-1582
 Project: Tika
  Issue Type: Improvement
  Components: detector, mime
Affects Versions: 1.7
Reporter: Luke sh
Priority: Trivial
 Attachments: nnmodel.docx


 Content-based mime type detection is one of the popular approaches to detect 
 mime type, there are others based on file extension and magic numbers ; And 
 currently Tika has implemented 3 approaches in detecting mime types; 
 They are :
 1) file extensions
 2) magic numbers (the most trustworthy in tika)
 3) content-type(the header in the http response if present and available) 
 Content-based mime type detection however analyses the distribution of the 
 entire stream of bytes and find a similar pattern for the same type and build 
 a function that is able to group them into one or several classes so as to 
 classify and predict; It is believed this feature might broaden the usage of 
 Tika with a bit more security enforcement for mime type detection. Because we 
 want to build a model that is etched with the patterns it has seen, in some 
 situations we may not trust those types which have not been trained/learned 
 by the model. In some situations, magic numbers imbedded in the files can be 
 copied but the actual content could be a potentially detrimental Troy 
 program. By enforcing the trust on byte frequency patterns, we are able to 
 enhance the security of the detection.
 The proposed content-based mime detection to be integrated into Tika is based 
 on the machine learning algorithm i.e. neural network with back-propagation. 
 The input: 0-255 bins each of which represent a byte, and and each of which 
 stores the count of occurrences for each byte, and the byte frequency 
 histograms are normalized to fall in the range between 0 and 1, they then are 
 passed to a companding function to enhancement the infrequent bytes.
 The output of the neural network is a binary decision 1 or 0;
 Notice BTW, the proposed feature will be implemented with GRB file type as 
 one example.
 In this example, we build a model that is able to classify GRB file type from 
 non-GRB file types, notice the size of non-GRB files is huge and cannot be 
 easily defined, so there need to be as many negative training example as 
 possible to form this non-GRB types decision boundary.
 The Neural networks is considered as two stage of processes.
 Training and classification.
 The training can be done in any programming language, in this feature 
 /research, the training of neural network is implemented in R and the source 
 can be found in my github repository i.e. 
 https://github.com/LukeLiush/filetypeDetection; i am also going to post a 
 document that describe the use of the program, the syntax/ format of the 
 input and output.
 After training, we need to export the model and import it to Tika; in Tika, 
 we create a TrainedModelDetector that reads this model file with one or more 
 model parameters or several model files,so it can detect the mime types with 
 the model of those mime types. Details of the research and usage with this 
 proposed feature will be posted on my github shortly.
 It is worth noting again that in this research we only worked out one model - 
 GRB as one example to demonstrate the use of this content-based mime 
 detection. One of the challenges again is that the non-GRB file types cannot 
 be clearly defined unless we feed our model with some example data for all of 
 the existing file types in the world, but this seems to be too utopian and a 
 bit less likely, so it is better that the set of class/types is given and 
 defined in advance to minimize the problem domain. 
 Another challenge is the size of the training data; even if we know the types 
 we want to classify, getting enough training data to form a model can be also 
 one of the main factors of success. In our example model, grb data are 
 collected from ftp://hydro1.sci.gsfc.nasa.gov/data/; and we find out that the 
 grb data from that source all exhibit a similar pattern, a simple neural 
 network structure is able to predict well, even a linear logistic regression 
 is able to do a good job; However, if we pass the GRB files collected from 
 other source to the model for prediction, then we find out that the model 
 predict poorly and unexpectedly, so this bring up the aspect of whether we 
 need to include all training data or those are of interest, including all 
 data is very expensive so it is necessary to introduce some domain knowledge 
 to

[jira] [Updated] (TIKA-1582) Mime Detection based on neural networks with Byte-frequency-histogram

2015-03-28 Thread Luke sh (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luke sh updated TIKA-1582:
--
Attachment: week2-report-histogram comparison.docx

histogram comparison

 Mime Detection based on neural networks with Byte-frequency-histogram 
 --

 Key: TIKA-1582
 URL: https://issues.apache.org/jira/browse/TIKA-1582
 Project: Tika
  Issue Type: Improvement
  Components: detector, mime
Affects Versions: 1.7
Reporter: Luke sh
Priority: Trivial
 Attachments: nnmodel.docx, week2-report-histogram comparison.docx, 
 week6 report.docx


 Content-based mime type detection is one of the popular approaches to detect 
 mime type, there are others based on file extension and magic numbers ; And 
 currently Tika has implemented 3 approaches in detecting mime types; 
 They are :
 1) file extensions
 2) magic numbers (the most trustworthy in tika)
 3) content-type(the header in the http response if present and available) 
 Content-based mime type detection however analyses the distribution of the 
 entire stream of bytes and find a similar pattern for the same type and build 
 a function that is able to group them into one or several classes so as to 
 classify and predict; It is believed this feature might broaden the usage of 
 Tika with a bit more security enforcement for mime type detection. Because we 
 want to build a model that is etched with the patterns it has seen, in some 
 situations we may not trust those types which have not been trained/learned 
 by the model. In some situations, magic numbers imbedded in the files can be 
 copied but the actual content could be a potentially detrimental Troy 
 program. By enforcing the trust on byte frequency patterns, we are able to 
 enhance the security of the detection.
 The proposed content-based mime detection to be integrated into Tika is based 
 on the machine learning algorithm i.e. neural network with back-propagation. 
 The input: 0-255 bins each of which represent a byte, and and each of which 
 stores the count of occurrences for each byte, and the byte frequency 
 histograms are normalized to fall in the range between 0 and 1, they then are 
 passed to a companding function to enhancement the infrequent bytes.
 The output of the neural network is a binary decision 1 or 0;
 Notice BTW, the proposed feature will be implemented with GRB file type as 
 one example.
 In this example, we build a model that is able to classify GRB file type from 
 non-GRB file types, notice the size of non-GRB files is huge and cannot be 
 easily defined, so there need to be as many negative training example as 
 possible to form this non-GRB types decision boundary.
 The Neural networks is considered as two stage of processes.
 Training and classification.
 The training can be done in any programming language, in this feature 
 /research, the training of neural network is implemented in R and the source 
 can be found in my github repository i.e. 
 https://github.com/LukeLiush/filetypeDetection; i am also going to post a 
 document that describe the use of the program, the syntax/ format of the 
 input and output.
 After training, we need to export the model and import it to Tika; in Tika, 
 we create a TrainedModelDetector that reads this model file with one or more 
 model parameters or several model files,so it can detect the mime types with 
 the model of those mime types. Details of the research and usage with this 
 proposed feature will be posted on my github shortly.
 It is worth noting again that in this research we only worked out one model - 
 GRB as one example to demonstrate the use of this content-based mime 
 detection. One of the challenges again is that the non-GRB file types cannot 
 be clearly defined unless we feed our model with some example data for all of 
 the existing file types in the world, but this seems to be too utopian and a 
 bit less likely, so it is better that the set of class/types is given and 
 defined in advance to minimize the problem domain. 
 Another challenge is the size of the training data; even if we know the types 
 we want to classify, getting enough training data to form a model can be also 
 one of the main factors of success. In our example model, grb data are 
 collected from ftp://hydro1.sci.gsfc.nasa.gov/data/; and we find out that the 
 grb data from that source all exhibit a similar pattern, a simple neural 
 network structure is able to predict well, even a linear logistic regression 
 is able to do a good job; However, if we pass the GRB files collected from 
 other source to the model for prediction, then we find out that the model 
 predict poorly and unexpectedly, so this bring up the aspect of whether we 
 need to include all training data or those are of interest,

[jira] [Updated] (TIKA-1582) Mime Detection based on neural networks with Byte-frequency-histogram

2015-03-28 Thread Luke sh (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luke sh updated TIKA-1582:
--
Attachment: week6 report.docx

Test report 

 Mime Detection based on neural networks with Byte-frequency-histogram 
 --

 Key: TIKA-1582
 URL: https://issues.apache.org/jira/browse/TIKA-1582
 Project: Tika
  Issue Type: Improvement
  Components: detector, mime
Affects Versions: 1.7
Reporter: Luke sh
Priority: Trivial
 Attachments: nnmodel.docx, week6 report.docx


 Content-based mime type detection is one of the popular approaches to detect 
 mime type, there are others based on file extension and magic numbers ; And 
 currently Tika has implemented 3 approaches in detecting mime types; 
 They are :
 1) file extensions
 2) magic numbers (the most trustworthy in tika)
 3) content-type(the header in the http response if present and available) 
 Content-based mime type detection however analyses the distribution of the 
 entire stream of bytes and find a similar pattern for the same type and build 
 a function that is able to group them into one or several classes so as to 
 classify and predict; It is believed this feature might broaden the usage of 
 Tika with a bit more security enforcement for mime type detection. Because we 
 want to build a model that is etched with the patterns it has seen, in some 
 situations we may not trust those types which have not been trained/learned 
 by the model. In some situations, magic numbers imbedded in the files can be 
 copied but the actual content could be a potentially detrimental Troy 
 program. By enforcing the trust on byte frequency patterns, we are able to 
 enhance the security of the detection.
 The proposed content-based mime detection to be integrated into Tika is based 
 on the machine learning algorithm i.e. neural network with back-propagation. 
 The input: 0-255 bins each of which represent a byte, and and each of which 
 stores the count of occurrences for each byte, and the byte frequency 
 histograms are normalized to fall in the range between 0 and 1, they then are 
 passed to a companding function to enhancement the infrequent bytes.
 The output of the neural network is a binary decision 1 or 0;
 Notice BTW, the proposed feature will be implemented with GRB file type as 
 one example.
 In this example, we build a model that is able to classify GRB file type from 
 non-GRB file types, notice the size of non-GRB files is huge and cannot be 
 easily defined, so there need to be as many negative training example as 
 possible to form this non-GRB types decision boundary.
 The Neural networks is considered as two stage of processes.
 Training and classification.
 The training can be done in any programming language, in this feature 
 /research, the training of neural network is implemented in R and the source 
 can be found in my github repository i.e. 
 https://github.com/LukeLiush/filetypeDetection; i am also going to post a 
 document that describe the use of the program, the syntax/ format of the 
 input and output.
 After training, we need to export the model and import it to Tika; in Tika, 
 we create a TrainedModelDetector that reads this model file with one or more 
 model parameters or several model files,so it can detect the mime types with 
 the model of those mime types. Details of the research and usage with this 
 proposed feature will be posted on my github shortly.
 It is worth noting again that in this research we only worked out one model - 
 GRB as one example to demonstrate the use of this content-based mime 
 detection. One of the challenges again is that the non-GRB file types cannot 
 be clearly defined unless we feed our model with some example data for all of 
 the existing file types in the world, but this seems to be too utopian and a 
 bit less likely, so it is better that the set of class/types is given and 
 defined in advance to minimize the problem domain. 
 Another challenge is the size of the training data; even if we know the types 
 we want to classify, getting enough training data to form a model can be also 
 one of the main factors of success. In our example model, grb data are 
 collected from ftp://hydro1.sci.gsfc.nasa.gov/data/; and we find out that the 
 grb data from that source all exhibit a similar pattern, a simple neural 
 network structure is able to predict well, even a linear logistic regression 
 is able to do a good job; However, if we pass the GRB files collected from 
 other source to the model for prediction, then we find out that the model 
 predict poorly and unexpectedly, so this bring up the aspect of whether we 
 need to include all training data or those are of interest, including all 
 data is very expensive so it is necessary to introduce some

[jira] [Created] (TIKA-1582) Mime Detection based on neural networks with Byte-frequency-histogram

2015-03-26 Thread Luke sh (JIRA)

Luke sh created TIKA-1582:
-

 Summary: Mime Detection based on neural networks with 
Byte-frequency-histogram 
 Key: TIKA-1582
 URL: https://issues.apache.org/jira/browse/TIKA-1582
 Project: Tika
  Issue Type: Improvement
  Components: detector, mime
Affects Versions: 1.7
Reporter: Luke sh
Priority: Trivial


Content-based mime type detection is one of the popular approaches to detect 
mime type, there are others based on file extension and magic numbers ; And 
currently Tika has implemented 3 approaches in detecting mime types; 
They are :
1) file extensions
2) magic numbers (the most trustworthy in tika)
3) content-type(the header in the http response if present and available)

Content-based mime type detection however analyses the distribution of the 
entire stream of bytes and find a similar pattern for the same type and build a 
function that is able to group them into one or several classes so as to 
classify and predict; It is believed this feature might broaden the usage of 
Tika with a bit more security enforcement for mime type detection. Because we 
want to build a model that is etched with the patterns it has seen, in some 
situations we may not trust those types which have not been trained/learned by 
the model. In some situations, magic numbers imbedded in the files can be 
copied but the actual content could be a potentially detrimental Troy program. 
By enforcing the trust on byte frequency patterns, we are able to enhance the 
security of the detection.

The proposed content-based mime detection to be integrated into Tika is based 
on the machine learning algorithm i.e. neural network with back-propagation. 

The input: 0-255 bins each of which represent a byte, and and each of which 
stores the count of occurrences for each byte, and the byte frequency 
histograms are normalized to fall in the range between 0 and 1, they then are 
passed to a compounding function to enhancement the infrequent bytes.
The output of the neural network is a binary decision 1 or 0;

Notice BTW, the proposed feature will be implemented with GRB file type as one 
example.

In this example, we build a model that is able to classify GRB file type from 
non-GRB file types, notice the size of non-GRB files is huge and cannot be 
easily defined, so there need to be as many negative training example as 
possible to form this non-GRB types decision boundary.

The Neural networks is considered as two stage of processes.
Training and classification.

The training can be done in any programming language, in this feature 
/research, the training of neural network is implemented in R and the source 
can be found in my github repository i.e. 
https://github.com/LukeLiush/filetypeDetection; i am also going to post a 
document that describe the use of the program, the syntax/ format of the input 
and output.

After training, we need to export the model and import it to Tika; in Tika, we 
create a TrainedModelDetector that reads this model file with one or more model 
parameters or several model files,so it can detect the mime types with the 
model of those mime types. Details of the research and usage with this proposed 
feature will be posted on my github shortly.

It is worth noting again that in this research we only worked out one model - 
GRB as one example to demonstrate the use of this content-based mime detection. 
One of the challenges again is that the non-GRB file types cannot be clearly 
defined unless we feed our model with some example data for all of the existing 
file types in the world, but this seems to be too utopian and a bit less 
likely, so it is better that the set of class/types is given and defined in 
advance to minimize the problem domain. 

Another challenge is the size of the training data; even if we know the types 
we want to classify, getting enough training data to form a model can be also 
one of the main factors of success. In our example model, grb data are 
collected from ftp://hydro1.sci.gsfc.nasa.gov/data/; and we find out that the 
grb data from that source all exhibit a similar pattern, a simple neural 
network structure is able to predict well, even a linear logistic regression is 
able to do a good job; However, if we pass the GRB files collected from other 
source to the model for prediction, then we find out that the model predict 
poorly and unexpectedly, so this bring up the aspect of whether we need to 
include all training data or those are of interest, including all data is very 
expensive so it is necessary to introduce some domain knowledge to minimize the 
problem domain; we believe users should know what types they want to classify 
and they should be able to get enough training data, although getting the 
training data can be a tedious and expensive process. Again it is better to 
have that domain knowledge with the set of types present in

[jira] [Commented] (TIKA-1582) Mime Detection based on neural networks with Byte-frequency-histogram

2015-03-26 Thread Luke sh (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14382590#comment-14382590
 ] 

Luke sh commented on TIKA-1582:
---

a pull request with this feature for Tika will be created shortly.

 Mime Detection based on neural networks with Byte-frequency-histogram 
 --

 Key: TIKA-1582
 URL: https://issues.apache.org/jira/browse/TIKA-1582
 Project: Tika
  Issue Type: Improvement
  Components: detector, mime
Affects Versions: 1.7
Reporter: Luke sh
Priority: Trivial

 Content-based mime type detection is one of the popular approaches to detect 
 mime type, there are others based on file extension and magic numbers ; And 
 currently Tika has implemented 3 approaches in detecting mime types; 
 They are :
 1) file extensions
 2) magic numbers (the most trustworthy in tika)
 3) content-type(the header in the http response if present and available)
 Content-based mime type detection however analyses the distribution of the 
 entire stream of bytes and find a similar pattern for the same type and build 
 a function that is able to group them into one or several classes so as to 
 classify and predict; It is believed this feature might broaden the usage of 
 Tika with a bit more security enforcement for mime type detection. Because we 
 want to build a model that is etched with the patterns it has seen, in some 
 situations we may not trust those types which have not been trained/learned 
 by the model. In some situations, magic numbers imbedded in the files can be 
 copied but the actual content could be a potentially detrimental Troy 
 program. By enforcing the trust on byte frequency patterns, we are able to 
 enhance the security of the detection.
 The proposed content-based mime detection to be integrated into Tika is based 
 on the machine learning algorithm i.e. neural network with back-propagation. 
 The input: 0-255 bins each of which represent a byte, and and each of which 
 stores the count of occurrences for each byte, and the byte frequency 
 histograms are normalized to fall in the range between 0 and 1, they then are 
 passed to a compounding function to enhancement the infrequent bytes.
 The output of the neural network is a binary decision 1 or 0;
 Notice BTW, the proposed feature will be implemented with GRB file type as 
 one example.
 In this example, we build a model that is able to classify GRB file type from 
 non-GRB file types, notice the size of non-GRB files is huge and cannot be 
 easily defined, so there need to be as many negative training example as 
 possible to form this non-GRB types decision boundary.
 The Neural networks is considered as two stage of processes.
 Training and classification.
 The training can be done in any programming language, in this feature 
 /research, the training of neural network is implemented in R and the source 
 can be found in my github repository i.e. 
 https://github.com/LukeLiush/filetypeDetection; i am also going to post a 
 document that describe the use of the program, the syntax/ format of the 
 input and output.
 After training, we need to export the model and import it to Tika; in Tika, 
 we create a TrainedModelDetector that reads this model file with one or more 
 model parameters or several model files,so it can detect the mime types with 
 the model of those mime types. Details of the research and usage with this 
 proposed feature will be posted on my github shortly.
 It is worth noting again that in this research we only worked out one model - 
 GRB as one example to demonstrate the use of this content-based mime 
 detection. One of the challenges again is that the non-GRB file types cannot 
 be clearly defined unless we feed our model with some example data for all of 
 the existing file types in the world, but this seems to be too utopian and a 
 bit less likely, so it is better that the set of class/types is given and 
 defined in advance to minimize the problem domain. 
 Another challenge is the size of the training data; even if we know the types 
 we want to classify, getting enough training data to form a model can be also 
 one of the main factors of success. In our example model, grb data are 
 collected from ftp://hydro1.sci.gsfc.nasa.gov/data/; and we find out that the 
 grb data from that source all exhibit a similar pattern, a simple neural 
 network structure is able to predict well, even a linear logistic regression 
 is able to do a good job; However, if we pass the GRB files collected from 
 other source to the model for prediction, then we find out that the model 
 predict poorly and unexpectedly, so this bring up the aspect of whether we 
 need to include all training data or those are of interest, including all 
 data is very expensive so it is

[jira] [Updated] (TIKA-1582) Mime Detection based on neural networks with Byte-frequency-histogram

2015-03-26 Thread Luke sh (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luke sh updated TIKA-1582:
--
Description: 
Content-based mime type detection is one of the popular approaches to detect 
mime type, there are others based on file extension and magic numbers ; And 
currently Tika has implemented 3 approaches in detecting mime types; 
They are :
1) file extensions
2) magic numbers (the most trustworthy in tika)
3) content-type(the header in the http response if present and available) 

Content-based mime type detection however analyses the distribution of the 
entire stream of bytes and find a similar pattern for the same type and build a 
function that is able to group them into one or several classes so as to 
classify and predict; It is believed this feature might broaden the usage of 
Tika with a bit more security enforcement for mime type detection. Because we 
want to build a model that is etched with the patterns it has seen, in some 
situations we may not trust those types which have not been trained/learned by 
the model. In some situations, magic numbers imbedded in the files can be 
copied but the actual content could be a potentially detrimental Troy program. 
By enforcing the trust on byte frequency patterns, we are able to enhance the 
security of the detection.

The proposed content-based mime detection to be integrated into Tika is based 
on the machine learning algorithm i.e. neural network with back-propagation. 

The input: 0-255 bins each of which represent a byte, and and each of which 
stores the count of occurrences for each byte, and the byte frequency 
histograms are normalized to fall in the range between 0 and 1, they then are 
passed to a companding function to enhancement the infrequent bytes.
The output of the neural network is a binary decision 1 or 0;

Notice BTW, the proposed feature will be implemented with GRB file type as one 
example.

In this example, we build a model that is able to classify GRB file type from 
non-GRB file types, notice the size of non-GRB files is huge and cannot be 
easily defined, so there need to be as many negative training example as 
possible to form this non-GRB types decision boundary.

The Neural networks is considered as two stage of processes.
Training and classification.

The training can be done in any programming language, in this feature 
/research, the training of neural network is implemented in R and the source 
can be found in my github repository i.e. 
https://github.com/LukeLiush/filetypeDetection; i am also going to post a 
document that describe the use of the program, the syntax/ format of the input 
and output.

After training, we need to export the model and import it to Tika; in Tika, we 
create a TrainedModelDetector that reads this model file with one or more model 
parameters or several model files,so it can detect the mime types with the 
model of those mime types. Details of the research and usage with this proposed 
feature will be posted on my github shortly.

It is worth noting again that in this research we only worked out one model - 
GRB as one example to demonstrate the use of this content-based mime detection. 
One of the challenges again is that the non-GRB file types cannot be clearly 
defined unless we feed our model with some example data for all of the existing 
file types in the world, but this seems to be too utopian and a bit less 
likely, so it is better that the set of class/types is given and defined in 
advance to minimize the problem domain. 

Another challenge is the size of the training data; even if we know the types 
we want to classify, getting enough training data to form a model can be also 
one of the main factors of success. In our example model, grb data are 
collected from ftp://hydro1.sci.gsfc.nasa.gov/data/; and we find out that the 
grb data from that source all exhibit a similar pattern, a simple neural 
network structure is able to predict well, even a linear logistic regression is 
able to do a good job; However, if we pass the GRB files collected from other 
source to the model for prediction, then we find out that the model predict 
poorly and unexpectedly, so this bring up the aspect of whether we need to 
include all training data or those are of interest, including all data is very 
expensive so it is necessary to introduce some domain knowledge to minimize the 
problem domain; we believe users should know what types they want to classify 
and they should be able to get enough training data, although getting the 
training data can be a tedious and expensive process. Again it is better to 
have that domain knowledge with the set of types present in users' database and 
train a model with some examples for every type in the database.

  was:
Content-based mime type detection is one of the popular approaches to detect 
mime type, there are others based on file extension and magic numbers ;

[jira] [Updated] (TIKA-1561) GCMD Directory Interchange Format (.dif) identification

2015-02-25 Thread Luke sh (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luke sh updated TIKA-1561:
--
Description: 
cited from the http://gcmd.nasa.gov/add/difguide/WRITEADIF.pdf 
The Directory Interchange Format (DIF) is metadata format used to create 
directory entries that describe scientific data
sets. A DIF holds a collection of fields, which detail specific information 
about the data.
 The .dif file respect proper xml format that describe the scientific data set, 
the schema xsd files can be found inside the .dif xml file.
i,e, http://gcmd.gsfc.nasa.gov/Aboutus/xml/dif/dif_v9.8.4.xsd

The reason opening this ticket is tika parser for this dif file is being under 
consideration with development, the support to identify the file is needed.
Although dif file in this case seems to be an xml file which can be parsed 
properly by xmlparser, still it might need a specific process on some of the 
fields to be extracted and injected into the System for analysis.
Then it is proposed that the following type 'text/dif+xml' is appended in the 
tika-mimetypes.xml that extends the application/xml, so that some special 
process can be applied to this particular xml file.

mime-type type=text/dif+xml
   root-XML localName=DIF/
   root-XML localName=DIF 
namespaceURI=http://gcmd.gsfc.nasa.gov/Aboutus/xml/dif//
   glob pattern=*.dif/
   sub-class-of type=application/xml/
/mime-type


Expected MIME type: text/dif+xml
The following is the link to the dif format guide
http://gcmd.nasa.gov/add/difguide/


example dif files:
1) https://www.aoncadis.org/dataset/id/005f3222-7548-11e2-851e-00c0f03d5b7c.dif
2) https://www.aoncadis.org/dataset/id/0091cf0c-7ad3-11e2-851e-00c0f03d5b7c.dif
3) https://www.aoncadis.org/dataset/id/02a6301c-3ab3-11e4-8ee7-00c0f03d5b7c.dif

an example dif file has also been attached.

  was:
cited from the http://gcmd.nasa.gov/add/difguide/WRITEADIF.pdf 
The Directory Interchange Format (DIF) is metadata format used to create 
directory entries that describe scientific data
sets. A DIF holds a collection of fields, which detail specific information 
about the data.
 The .dif file respect proper xml format that describe the scientific data set, 
the schema xsd files can be found inside the .dif xml file.
i,e, http://gcmd.gsfc.nasa.gov/Aboutus/xml/dif/dif_v9.8.4.xsd

The reason opening this ticket is tika parser for this dif file is being under 
consideration with development, the support to identify the file is needed.
Although dif file in this case seems to be an xml file which can be parsed 
properly by xmlparser, still it might need a specific process on some of the 
fields to be extracted and injected into the System for analysis.
Then it is decided that the following type 'text/dif+xml' is used that extends 
the application/xml, so that we can apply some special process to this 
particular xml file.

mime-type type=text/dif+xml
   root-XML localName=DIF/
   root-XML localName=DIF 
namespaceURI=http://gcmd.gsfc.nasa.gov/Aboutus/xml/dif//
   glob pattern=*.dif/
   sub-class-of type=application/xml/
/mime-type


Expected MIME type: text/dif+xml
The following is the link to the dif format guide
http://gcmd.nasa.gov/add/difguide/


example dif files:
1) https://www.aoncadis.org/dataset/id/005f3222-7548-11e2-851e-00c0f03d5b7c.dif
2) https://www.aoncadis.org/dataset/id/0091cf0c-7ad3-11e2-851e-00c0f03d5b7c.dif
3) https://www.aoncadis.org/dataset/id/02a6301c-3ab3-11e4-8ee7-00c0f03d5b7c.dif

an example dif file has also been attached.


 GCMD Directory Interchange Format (.dif) identification
 ---

 Key: TIKA-1561
 URL: https://issues.apache.org/jira/browse/TIKA-1561
 Project: Tika
  Issue Type: Improvement
  Components: mime
Affects Versions: 1.7
Reporter: Luke sh
Priority: Trivial
 Attachments: 
 carbon_isotopic_values_of_alkanes_extracted_from_paleosols.dif


 cited from the http://gcmd.nasa.gov/add/difguide/WRITEADIF.pdf 
 The Directory Interchange Format (DIF) is metadata format used to create 
 directory entries that describe scientific data
 sets. A DIF holds a collection of fields, which detail specific information 
 about the data.
  The .dif file respect proper xml format that describe the scientific data 
 set, the schema xsd files can be found inside the .dif xml file.
 i,e, http://gcmd.gsfc.nasa.gov/Aboutus/xml/dif/dif_v9.8.4.xsd
 The reason opening this ticket is tika parser for this dif file is being 
 under consideration with development, the support to identify the file is 
 needed.
 Although dif file in this case seems to be an xml file which can be parsed 
 properly by xmlparser, still it might need a specific process on some of the 
 fields to be extracted and injected into the System for analysis.
 Then it is proposed that the following type 'text/dif+xml' is

[jira] [Updated] (TIKA-1561) GCMD Directory Interchange Format (.dif) identification

2015-02-25 Thread Luke sh (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luke sh updated TIKA-1561:
--
Attachment: (was: 
carbon_isotopic_values_of_alkanes_extracted_from_paleosols.dif)

 GCMD Directory Interchange Format (.dif) identification
 ---

 Key: TIKA-1561
 URL: https://issues.apache.org/jira/browse/TIKA-1561
 Project: Tika
  Issue Type: Improvement
  Components: mime
Affects Versions: 1.7
Reporter: Luke sh
Priority: Trivial

 cited from the http://gcmd.nasa.gov/add/difguide/WRITEADIF.pdf 
 The Directory Interchange Format (DIF) is metadata format used to create 
 directory entries that describe scientific data
 sets. A DIF holds a collection of fields, which detail specific information 
 about the data.
  The .dif file respect proper xml format that describe the scientific data 
 set, the schema xsd files can be found inside the .dif xml file.
 i,e, http://gcmd.gsfc.nasa.gov/Aboutus/xml/dif/dif_v9.8.4.xsd
 The reason opening this ticket is tika parser for this dif file is being 
 under consideration with development, the support to identify the file is 
 needed.
 Although dif file in this case seems to be an xml file which can be parsed 
 properly by xmlparser, still it might need a specific process on some of the 
 fields to be extracted and injected into the System for analysis.
 Then it is decided that the following type 'text/dif+xml' is used that 
 extends the application/xml, so that we can apply some special process to 
 this particular xml file.
 mime-type type=text/dif+xml
root-XML localName=DIF/
root-XML localName=DIF 
 namespaceURI=http://gcmd.gsfc.nasa.gov/Aboutus/xml/dif//
glob pattern=*.dif/
sub-class-of type=application/xml/
 /mime-type
 Expected MIME type: text/dif+xml
 The following is the link to the dif format guide
 http://gcmd.nasa.gov/add/difguide/
 example dif files:
 1) 
 https://www.aoncadis.org/dataset/id/005f3222-7548-11e2-851e-00c0f03d5b7c.dif
 2) 
 https://www.aoncadis.org/dataset/id/0091cf0c-7ad3-11e2-851e-00c0f03d5b7c.dif
 3) 
 https://www.aoncadis.org/dataset/id/02a6301c-3ab3-11e4-8ee7-00c0f03d5b7c.dif
 an example dif file has also been attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Issue Comment Deleted] (TIKA-1561) GCMD Directory Interchange Format (.dif) identification

2015-02-25 Thread Luke sh (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luke sh updated TIKA-1561:
--
Comment: was deleted

(was: sample dif file)

 GCMD Directory Interchange Format (.dif) identification
 ---

 Key: TIKA-1561
 URL: https://issues.apache.org/jira/browse/TIKA-1561
 Project: Tika
  Issue Type: Improvement
  Components: mime
Affects Versions: 1.7
Reporter: Luke sh
Priority: Trivial

 cited from the http://gcmd.nasa.gov/add/difguide/WRITEADIF.pdf 
 The Directory Interchange Format (DIF) is metadata format used to create 
 directory entries that describe scientific data
 sets. A DIF holds a collection of fields, which detail specific information 
 about the data.
  The .dif file respect proper xml format that describe the scientific data 
 set, the schema xsd files can be found inside the .dif xml file.
 i,e, http://gcmd.gsfc.nasa.gov/Aboutus/xml/dif/dif_v9.8.4.xsd
 The reason opening this ticket is tika parser for this dif file is being 
 under consideration with development, the support to identify the file is 
 needed.
 Although dif file in this case seems to be an xml file which can be parsed 
 properly by xmlparser, still it might need a specific process on some of the 
 fields to be extracted and injected into the System for analysis.
 Then it is decided that the following type 'text/dif+xml' is used that 
 extends the application/xml, so that we can apply some special process to 
 this particular xml file.
 mime-type type=text/dif+xml
root-XML localName=DIF/
root-XML localName=DIF 
 namespaceURI=http://gcmd.gsfc.nasa.gov/Aboutus/xml/dif//
glob pattern=*.dif/
sub-class-of type=application/xml/
 /mime-type
 Expected MIME type: text/dif+xml
 The following is the link to the dif format guide
 http://gcmd.nasa.gov/add/difguide/
 example dif files:
 1) 
 https://www.aoncadis.org/dataset/id/005f3222-7548-11e2-851e-00c0f03d5b7c.dif
 2) 
 https://www.aoncadis.org/dataset/id/0091cf0c-7ad3-11e2-851e-00c0f03d5b7c.dif
 3) 
 https://www.aoncadis.org/dataset/id/02a6301c-3ab3-11e4-8ee7-00c0f03d5b7c.dif
 an example dif file has also been attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (TIKA-1561) GCMD Directory Interchange Format (.dif) identification

2015-02-25 Thread Luke sh (JIRA)

Luke sh created TIKA-1561:
-

 Summary: GCMD Directory Interchange Format (.dif) identification
 Key: TIKA-1561
 URL: https://issues.apache.org/jira/browse/TIKA-1561
 Project: Tika
  Issue Type: Improvement
  Components: mime
Affects Versions: 1.7
Reporter: Luke sh
Priority: Trivial


cited from the http://gcmd.nasa.gov/add/difguide/WRITEADIF.pdf 
The Directory Interchange Format (DIF) is metadata format used to create 
directory entries that describe scientific data
sets. A DIF holds a collection of fields, which detail specific information 
about the data.
 The .dif file respect proper xml format that describe the scientific data set, 
the schema xsd files can be found inside the .dif xml file.
i,e, http://gcmd.gsfc.nasa.gov/Aboutus/xml/dif/dif_v9.8.4.xsd

The reason opening this ticket is tika parser for this dif file is being under 
consideration with development, the support to identify the file is needed.
Although dif file in this case seems to be an xml file which can be parsed 
properly by xmlparser, still it might need a specific process on some of the 
fields to be extracted and injected into the System for analysis.
Then it is decided that the following type 'text/dif+xml' is used that extends 
the application/xml, so that we can apply some special process to this 
particular xml file.

mime-type type=text/dif+xml
   root-XML localName=DIF/
   root-XML localName=DIF 
namespaceURI=http://gcmd.gsfc.nasa.gov/Aboutus/xml/dif//
   glob pattern=*.dif/
   sub-class-of type=application/xml/
/mime-type


Expected MIME type: text/dif+xml
The following is the link to the dif format guide
http://gcmd.nasa.gov/add/difguide/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-1561) GCMD Directory Interchange Format (.dif) identification

2015-02-25 Thread Luke sh (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luke sh updated TIKA-1561:
--
Attachment: carbon_isotopic_values_of_alkanes_extracted_from_paleosols.dif

sample dif file

 GCMD Directory Interchange Format (.dif) identification
 ---

 Key: TIKA-1561
 URL: https://issues.apache.org/jira/browse/TIKA-1561
 Project: Tika
  Issue Type: Improvement
  Components: mime
Affects Versions: 1.7
Reporter: Luke sh
Priority: Trivial
 Attachments: 
 carbon_isotopic_values_of_alkanes_extracted_from_paleosols.dif


 cited from the http://gcmd.nasa.gov/add/difguide/WRITEADIF.pdf 
 The Directory Interchange Format (DIF) is metadata format used to create 
 directory entries that describe scientific data
 sets. A DIF holds a collection of fields, which detail specific information 
 about the data.
  The .dif file respect proper xml format that describe the scientific data 
 set, the schema xsd files can be found inside the .dif xml file.
 i,e, http://gcmd.gsfc.nasa.gov/Aboutus/xml/dif/dif_v9.8.4.xsd
 The reason opening this ticket is tika parser for this dif file is being 
 under consideration with development, the support to identify the file is 
 needed.
 Although dif file in this case seems to be an xml file which can be parsed 
 properly by xmlparser, still it might need a specific process on some of the 
 fields to be extracted and injected into the System for analysis.
 Then it is decided that the following type 'text/dif+xml' is used that 
 extends the application/xml, so that we can apply some special process to 
 this particular xml file.
 mime-type type=text/dif+xml
root-XML localName=DIF/
root-XML localName=DIF 
 namespaceURI=http://gcmd.gsfc.nasa.gov/Aboutus/xml/dif//
glob pattern=*.dif/
sub-class-of type=application/xml/
 /mime-type
 Expected MIME type: text/dif+xml
 The following is the link to the dif format guide
 http://gcmd.nasa.gov/add/difguide/
 example dif files:
 1) 
 https://www.aoncadis.org/dataset/id/005f3222-7548-11e2-851e-00c0f03d5b7c.dif
 2) 
 https://www.aoncadis.org/dataset/id/0091cf0c-7ad3-11e2-851e-00c0f03d5b7c.dif
 3) 
 https://www.aoncadis.org/dataset/id/02a6301c-3ab3-11e4-8ee7-00c0f03d5b7c.dif
 an example dif file has also been attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-1561) GCMD Directory Interchange Format (.dif) identification

2015-02-25 Thread Luke sh (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luke sh updated TIKA-1561:
--
Description: 
cited from the http://gcmd.nasa.gov/add/difguide/WRITEADIF.pdf 
The Directory Interchange Format (DIF) is metadata format used to create 
directory entries that describe scientific data
sets. A DIF holds a collection of fields, which detail specific information 
about the data.
 The .dif file respect proper xml format that describe the scientific data set, 
the schema xsd files can be found inside the .dif xml file.
i,e, http://gcmd.gsfc.nasa.gov/Aboutus/xml/dif/dif_v9.8.4.xsd

The reason opening this ticket is tika parser for this dif file is being under 
consideration with development, the support to identify the file is needed.
Although dif file in this case seems to be an xml file which can be parsed 
properly by xmlparser, still it might need a specific process on some of the 
fields to be extracted and injected into the System for analysis.
Then it is decided that the following type 'text/dif+xml' is used that extends 
the application/xml, so that we can apply some special process to this 
particular xml file.

mime-type type=text/dif+xml
   root-XML localName=DIF/
   root-XML localName=DIF 
namespaceURI=http://gcmd.gsfc.nasa.gov/Aboutus/xml/dif//
   glob pattern=*.dif/
   sub-class-of type=application/xml/
/mime-type


Expected MIME type: text/dif+xml
The following is the link to the dif format guide
http://gcmd.nasa.gov/add/difguide/


example dif files:
1) https://www.aoncadis.org/dataset/id/005f3222-7548-11e2-851e-00c0f03d5b7c.dif
2) https://www.aoncadis.org/dataset/id/0091cf0c-7ad3-11e2-851e-00c0f03d5b7c.dif
3) https://www.aoncadis.org/dataset/id/02a6301c-3ab3-11e4-8ee7-00c0f03d5b7c.dif

an example dif file has also been attached.

  was:
cited from the http://gcmd.nasa.gov/add/difguide/WRITEADIF.pdf 
The Directory Interchange Format (DIF) is metadata format used to create 
directory entries that describe scientific data
sets. A DIF holds a collection of fields, which detail specific information 
about the data.
 The .dif file respect proper xml format that describe the scientific data set, 
the schema xsd files can be found inside the .dif xml file.
i,e, http://gcmd.gsfc.nasa.gov/Aboutus/xml/dif/dif_v9.8.4.xsd

The reason opening this ticket is tika parser for this dif file is being under 
consideration with development, the support to identify the file is needed.
Although dif file in this case seems to be an xml file which can be parsed 
properly by xmlparser, still it might need a specific process on some of the 
fields to be extracted and injected into the System for analysis.
Then it is decided that the following type 'text/dif+xml' is used that extends 
the application/xml, so that we can apply some special process to this 
particular xml file.

mime-type type=text/dif+xml
   root-XML localName=DIF/
   root-XML localName=DIF 
namespaceURI=http://gcmd.gsfc.nasa.gov/Aboutus/xml/dif//
   glob pattern=*.dif/
   sub-class-of type=application/xml/
/mime-type


Expected MIME type: text/dif+xml
The following is the link to the dif format guide
http://gcmd.nasa.gov/add/difguide/


 GCMD Directory Interchange Format (.dif) identification
 ---

 Key: TIKA-1561
 URL: https://issues.apache.org/jira/browse/TIKA-1561
 Project: Tika
  Issue Type: Improvement
  Components: mime
Affects Versions: 1.7
Reporter: Luke sh
Priority: Trivial
 Attachments: 
 carbon_isotopic_values_of_alkanes_extracted_from_paleosols.dif


 cited from the http://gcmd.nasa.gov/add/difguide/WRITEADIF.pdf 
 The Directory Interchange Format (DIF) is metadata format used to create 
 directory entries that describe scientific data
 sets. A DIF holds a collection of fields, which detail specific information 
 about the data.
  The .dif file respect proper xml format that describe the scientific data 
 set, the schema xsd files can be found inside the .dif xml file.
 i,e, http://gcmd.gsfc.nasa.gov/Aboutus/xml/dif/dif_v9.8.4.xsd
 The reason opening this ticket is tika parser for this dif file is being 
 under consideration with development, the support to identify the file is 
 needed.
 Although dif file in this case seems to be an xml file which can be parsed 
 properly by xmlparser, still it might need a specific process on some of the 
 fields to be extracted and injected into the System for analysis.
 Then it is decided that the following type 'text/dif+xml' is used that 
 extends the application/xml, so that we can apply some special process to 
 this particular xml file.
 mime-type type=text/dif+xml
root-XML localName=DIF/
root-XML localName=DIF 
 namespaceURI=http://gcmd.gsfc.nasa.gov/Aboutus/xml/dif//
glob pattern=*.dif/
sub-class-of type=application/xml/
 /mime-type

[jira] [Commented] (TIKA-1561) GCMD Directory Interchange Format (.dif) identification

2015-02-25 Thread Luke sh (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14337774#comment-14337774
 ] 

Luke sh commented on TIKA-1561:
---

I am going to send an pull request with this dif type identification...
working in progress

 GCMD Directory Interchange Format (.dif) identification
 ---

 Key: TIKA-1561
 URL: https://issues.apache.org/jira/browse/TIKA-1561
 Project: Tika
  Issue Type: Improvement
  Components: mime
Affects Versions: 1.7
Reporter: Luke sh
Priority: Trivial
 Attachments: 
 carbon_isotopic_values_of_alkanes_extracted_from_paleosols.dif


 cited from the http://gcmd.nasa.gov/add/difguide/WRITEADIF.pdf 
 The Directory Interchange Format (DIF) is metadata format used to create 
 directory entries that describe scientific data
 sets. A DIF holds a collection of fields, which detail specific information 
 about the data.
  The .dif file respect proper xml format that describe the scientific data 
 set, the schema xsd files can be found inside the .dif xml file.
 i,e, http://gcmd.gsfc.nasa.gov/Aboutus/xml/dif/dif_v9.8.4.xsd
 The reason opening this ticket is tika parser for this dif file is being 
 under consideration with development, the support to identify the file is 
 needed.
 Although dif file in this case seems to be an xml file which can be parsed 
 properly by xmlparser, still it might need a specific process on some of the 
 fields to be extracted and injected into the System for analysis.
 Then it is decided that the following type 'text/dif+xml' is used that 
 extends the application/xml, so that we can apply some special process to 
 this particular xml file.
 mime-type type=text/dif+xml
root-XML localName=DIF/
root-XML localName=DIF 
 namespaceURI=http://gcmd.gsfc.nasa.gov/Aboutus/xml/dif//
glob pattern=*.dif/
sub-class-of type=application/xml/
 /mime-type
 Expected MIME type: text/dif+xml
 The following is the link to the dif format guide
 http://gcmd.nasa.gov/add/difguide/
 example dif files:
 1) 
 https://www.aoncadis.org/dataset/id/005f3222-7548-11e2-851e-00c0f03d5b7c.dif
 2) 
 https://www.aoncadis.org/dataset/id/0091cf0c-7ad3-11e2-851e-00c0f03d5b7c.dif
 3) 
 https://www.aoncadis.org/dataset/id/02a6301c-3ab3-11e4-8ee7-00c0f03d5b7c.dif
 an example dif file has also been attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-1561) GCMD Directory Interchange Format (.dif) identification

2015-02-25 Thread Luke sh (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luke sh updated TIKA-1561:
--
Attachment: carbon_isotopic_values_of_alkanes_extracted_from_paleosols.dif

sample dif file

 GCMD Directory Interchange Format (.dif) identification
 ---

 Key: TIKA-1561
 URL: https://issues.apache.org/jira/browse/TIKA-1561
 Project: Tika
  Issue Type: Improvement
  Components: mime
Affects Versions: 1.7
Reporter: Luke sh
Priority: Trivial
 Attachments: 
 carbon_isotopic_values_of_alkanes_extracted_from_paleosols.dif


 cited from the http://gcmd.nasa.gov/add/difguide/WRITEADIF.pdf 
 The Directory Interchange Format (DIF) is metadata format used to create 
 directory entries that describe scientific data
 sets. A DIF holds a collection of fields, which detail specific information 
 about the data.
  The .dif file respect proper xml format that describe the scientific data 
 set, the schema xsd files can be found inside the .dif xml file.
 i,e, http://gcmd.gsfc.nasa.gov/Aboutus/xml/dif/dif_v9.8.4.xsd
 The reason opening this ticket is tika parser for this dif file is being 
 under consideration with development, the support to identify the file is 
 needed.
 Although dif file in this case seems to be an xml file which can be parsed 
 properly by xmlparser, still it might need a specific process on some of the 
 fields to be extracted and injected into the System for analysis.
 Then it is decided that the following type 'text/dif+xml' is used that 
 extends the application/xml, so that we can apply some special process to 
 this particular xml file.
 mime-type type=text/dif+xml
root-XML localName=DIF/
root-XML localName=DIF 
 namespaceURI=http://gcmd.gsfc.nasa.gov/Aboutus/xml/dif//
glob pattern=*.dif/
sub-class-of type=application/xml/
 /mime-type
 Expected MIME type: text/dif+xml
 The following is the link to the dif format guide
 http://gcmd.nasa.gov/add/difguide/
 example dif files:
 1) 
 https://www.aoncadis.org/dataset/id/005f3222-7548-11e2-851e-00c0f03d5b7c.dif
 2) 
 https://www.aoncadis.org/dataset/id/0091cf0c-7ad3-11e2-851e-00c0f03d5b7c.dif
 3) 
 https://www.aoncadis.org/dataset/id/02a6301c-3ab3-11e4-8ee7-00c0f03d5b7c.dif
 an example dif file has also been attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-1561) GCMD Directory Interchange Format (.dif) identification

2015-02-25 Thread Luke sh (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luke sh updated TIKA-1561:
--
Description: 
cited from the http://gcmd.nasa.gov/add/difguide/WRITEADIF.pdf 
The Directory Interchange Format (DIF) is metadata format used to create 
directory entries that describe scientific data
sets. A DIF holds a collection of fields, which detail specific information 
about the data.
 The .dif file respect proper xml format that describe the scientific data set, 
the schema xsd files can be found inside the .dif xml file.
i,e, http://gcmd.gsfc.nasa.gov/Aboutus/xml/dif/dif_v9.8.4.xsd

The reason opening this ticket is tika parser for this dif file is being under 
consideration with development, the support to identify the type of xml file is 
needed.
Although dif file in this case seems to be an proper xml file which can be 
parsed by xmlparser, still it might need a specific process on some of the 
fields to be extracted and injected into the Solr System for analysis.
Then it is proposed that the following type 'text/dif+xml' is appended in the 
tika-mimetypes.xml that extends the application/xml, so that some special 
process can be applied to this particular xml file.

mime-type type=text/dif+xml
   root-XML localName=DIF/
   root-XML localName=DIF 
namespaceURI=http://gcmd.gsfc.nasa.gov/Aboutus/xml/dif//
   glob pattern=*.dif/
   sub-class-of type=application/xml/
/mime-type


Expected MIME type: text/dif+xml
The following is the link to the dif format guide
http://gcmd.nasa.gov/add/difguide/


example dif files:
1) https://www.aoncadis.org/dataset/id/005f3222-7548-11e2-851e-00c0f03d5b7c.dif
2) https://www.aoncadis.org/dataset/id/0091cf0c-7ad3-11e2-851e-00c0f03d5b7c.dif
3) https://www.aoncadis.org/dataset/id/02a6301c-3ab3-11e4-8ee7-00c0f03d5b7c.dif

an example dif file has also been attached.

  was:
cited from the http://gcmd.nasa.gov/add/difguide/WRITEADIF.pdf 
The Directory Interchange Format (DIF) is metadata format used to create 
directory entries that describe scientific data
sets. A DIF holds a collection of fields, which detail specific information 
about the data.
 The .dif file respect proper xml format that describe the scientific data set, 
the schema xsd files can be found inside the .dif xml file.
i,e, http://gcmd.gsfc.nasa.gov/Aboutus/xml/dif/dif_v9.8.4.xsd

The reason opening this ticket is tika parser for this dif file is being under 
consideration with development, the support to identify the file is needed.
Although dif file in this case seems to be an xml file which can be parsed 
properly by xmlparser, still it might need a specific process on some of the 
fields to be extracted and injected into the System for analysis.
Then it is proposed that the following type 'text/dif+xml' is appended in the 
tika-mimetypes.xml that extends the application/xml, so that some special 
process can be applied to this particular xml file.

mime-type type=text/dif+xml
   root-XML localName=DIF/
   root-XML localName=DIF 
namespaceURI=http://gcmd.gsfc.nasa.gov/Aboutus/xml/dif//
   glob pattern=*.dif/
   sub-class-of type=application/xml/
/mime-type


Expected MIME type: text/dif+xml
The following is the link to the dif format guide
http://gcmd.nasa.gov/add/difguide/


example dif files:
1) https://www.aoncadis.org/dataset/id/005f3222-7548-11e2-851e-00c0f03d5b7c.dif
2) https://www.aoncadis.org/dataset/id/0091cf0c-7ad3-11e2-851e-00c0f03d5b7c.dif
3) https://www.aoncadis.org/dataset/id/02a6301c-3ab3-11e4-8ee7-00c0f03d5b7c.dif

an example dif file has also been attached.


 GCMD Directory Interchange Format (.dif) identification
 ---

 Key: TIKA-1561
 URL: https://issues.apache.org/jira/browse/TIKA-1561
 Project: Tika
  Issue Type: Improvement
  Components: mime
Affects Versions: 1.7
Reporter: Luke sh
Priority: Trivial
 Attachments: 
 carbon_isotopic_values_of_alkanes_extracted_from_paleosols.dif


 cited from the http://gcmd.nasa.gov/add/difguide/WRITEADIF.pdf 
 The Directory Interchange Format (DIF) is metadata format used to create 
 directory entries that describe scientific data
 sets. A DIF holds a collection of fields, which detail specific information 
 about the data.
  The .dif file respect proper xml format that describe the scientific data 
 set, the schema xsd files can be found inside the .dif xml file.
 i,e, http://gcmd.gsfc.nasa.gov/Aboutus/xml/dif/dif_v9.8.4.xsd
 The reason opening this ticket is tika parser for this dif file is being 
 under consideration with development, the support to identify the type of xml 
 file is needed.
 Although dif file in this case seems to be an proper xml file which can be 
 parsed by xmlparser, still it might need a specific process on some of the 
 fields to be extracted and injected into the Solr System for analysis.

[jira] [Updated] (TIKA-1561) GCMD Directory Interchange Format (.dif) identification

2015-02-25 Thread Luke sh (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luke sh updated TIKA-1561:
--
Description: 
cited from the http://gcmd.nasa.gov/add/difguide/WRITEADIF.pdf 
The Directory Interchange Format (DIF) is metadata format used to create 
directory entries that describe scientific data
sets. A DIF holds a collection of fields, which detail specific information 
about the data.
 The .dif file respect proper xml format that describe the scientific data set, 
the schema xsd files can be found inside the .dif xml file.
i,e, http://gcmd.gsfc.nasa.gov/Aboutus/xml/dif/dif_v9.8.4.xsd

The reason opening this ticket is tika parser for this dif file is being under 
consideration with development, the support to identify the type of xml file is 
needed.
Although dif file in this case seems to be an proper xml file which can be 
parsed by xmlparser, still it might need a specific process on some of the 
fields to be extracted and injected into the Solr System for analysis.
Then it is proposed that the following type 'text/dif+xml' is appended and used 
in the tika-mimetypes.xml to be able to support the specific xml type detection 
which extends the application/xml, so that some special process can be applied 
to this particular xml file.

mime-type type=text/dif+xml
   root-XML localName=DIF/
   root-XML localName=DIF 
namespaceURI=http://gcmd.gsfc.nasa.gov/Aboutus/xml/dif//
   glob pattern=*.dif/
   sub-class-of type=application/xml/
/mime-type


Expected MIME type: text/dif+xml
The following is the link to the dif format guide
http://gcmd.nasa.gov/add/difguide/


example dif files:
1) https://www.aoncadis.org/dataset/id/005f3222-7548-11e2-851e-00c0f03d5b7c.dif
2) https://www.aoncadis.org/dataset/id/0091cf0c-7ad3-11e2-851e-00c0f03d5b7c.dif
3) https://www.aoncadis.org/dataset/id/02a6301c-3ab3-11e4-8ee7-00c0f03d5b7c.dif

an example dif file has also been attached.

  was:
cited from the http://gcmd.nasa.gov/add/difguide/WRITEADIF.pdf 
The Directory Interchange Format (DIF) is metadata format used to create 
directory entries that describe scientific data
sets. A DIF holds a collection of fields, which detail specific information 
about the data.
 The .dif file respect proper xml format that describe the scientific data set, 
the schema xsd files can be found inside the .dif xml file.
i,e, http://gcmd.gsfc.nasa.gov/Aboutus/xml/dif/dif_v9.8.4.xsd

The reason opening this ticket is tika parser for this dif file is being under 
consideration with development, the support to identify the type of xml file is 
needed.
Although dif file in this case seems to be an proper xml file which can be 
parsed by xmlparser, still it might need a specific process on some of the 
fields to be extracted and injected into the Solr System for analysis.
Then it is proposed that the following type 'text/dif+xml' is appended in the 
tika-mimetypes.xml that extends the application/xml, so that some special 
process can be applied to this particular xml file.

mime-type type=text/dif+xml
   root-XML localName=DIF/
   root-XML localName=DIF 
namespaceURI=http://gcmd.gsfc.nasa.gov/Aboutus/xml/dif//
   glob pattern=*.dif/
   sub-class-of type=application/xml/
/mime-type


Expected MIME type: text/dif+xml
The following is the link to the dif format guide
http://gcmd.nasa.gov/add/difguide/


example dif files:
1) https://www.aoncadis.org/dataset/id/005f3222-7548-11e2-851e-00c0f03d5b7c.dif
2) https://www.aoncadis.org/dataset/id/0091cf0c-7ad3-11e2-851e-00c0f03d5b7c.dif
3) https://www.aoncadis.org/dataset/id/02a6301c-3ab3-11e4-8ee7-00c0f03d5b7c.dif

an example dif file has also been attached.


 GCMD Directory Interchange Format (.dif) identification
 ---

 Key: TIKA-1561
 URL: https://issues.apache.org/jira/browse/TIKA-1561
 Project: Tika
  Issue Type: Improvement
  Components: mime
Affects Versions: 1.7
Reporter: Luke sh
Priority: Trivial
 Attachments: 
 carbon_isotopic_values_of_alkanes_extracted_from_paleosols.dif


 cited from the http://gcmd.nasa.gov/add/difguide/WRITEADIF.pdf 
 The Directory Interchange Format (DIF) is metadata format used to create 
 directory entries that describe scientific data
 sets. A DIF holds a collection of fields, which detail specific information 
 about the data.
  The .dif file respect proper xml format that describe the scientific data 
 set, the schema xsd files can be found inside the .dif xml file.
 i,e, http://gcmd.gsfc.nasa.gov/Aboutus/xml/dif/dif_v9.8.4.xsd
 The reason opening this ticket is tika parser for this dif file is being 
 under consideration with development, the support to identify the type of xml 
 file is needed.
 Although dif file in this case seems to be an proper xml file which can be 
 parsed by xmlparser, still it might need a specific process on

[jira] [Commented] (TIKA-1539) GRB file magic bytes and extension matching

2015-02-05 Thread Luke sh (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14308212#comment-14308212
 ] 

Luke sh commented on TIKA-1539:
---

pull request #28, add grb files for unit tests.
 

 GRB file magic bytes and extension matching 
 

 Key: TIKA-1539
 URL: https://issues.apache.org/jira/browse/TIKA-1539
 Project: Tika
  Issue Type: Improvement
  Components: mime
Affects Versions: 1.7
Reporter: Luke sh

 GRB type detection with magic bytes and extension probably needs to be 
 supported Tika, the GRB parser is under development, so it may be good to 
 have its magic bytes and extension matching detection.
 However, GRB does not have standard mime type, the following extension and 
 MAGIC matching settings in the tika-mimetypes.xml are proposed to used for 
 GRB mime type idenfication.
 mime-type type=application/x-grib
 acronymGRIB/acronym
 _commentGeneral Regularly-distributed Information in Binary 
 form/_comment
 tika:linkhttp://en.wikipedia.org/wiki/GRIB/tika:link
 magic priority=50
   match value=GRIB type=string offset=0/
 /magic
 glob pattern=*.grb/
 glob pattern=*.grb1/
 glob pattern=*.grb2/
 Any kind suggestion and advice will be welcomed and appreciated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (TIKA-1539) GRB file magic bytes and extension matching

2015-02-03 Thread Luke sh (JIRA)

Luke sh created TIKA-1539:
-

 Summary: GRB file magic bytes and extension matching 
 Key: TIKA-1539
 URL: https://issues.apache.org/jira/browse/TIKA-1539
 Project: Tika
  Issue Type: Improvement
  Components: mime
Affects Versions: 1.7
Reporter: Luke sh


GRB type detection with magic bytes and extension probably needs to be 
supported Tika, the GRB parser is under development, so it may be good to have 
its magic bytes and extension matching detection.

However, GRB does not have standard mime type, the following extension and 
MAGIC matching settings in the tika-mimetypes.xml are proposed to used for GRB 
mime type idenfication.

mime-type type=application/x-grib
acronymGRIB/acronym
_commentGeneral Regularly-distributed Information in Binary 
form/_comment
tika:linkhttp://en.wikipedia.org/wiki/GRIB/tika:link
magic priority=50
  match value=GRIB type=string offset=0/
/magic
glob pattern=*.grb/
glob pattern=*.grb1/
glob pattern=*.grb2/


Any kind suggestion and advice will be welcomed and appreciated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-1535) Inheritance modification for the class MIMETypes

2015-01-29 Thread Luke sh (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luke sh updated TIKA-1535:
--
Description: 
The Class MIMETypes does not currently allow for inheritance.

There are a couple of methods in this class which looks independent, and some 
of which needs to be exposed or overwritten for special needs or use cases, 
this will enable tika users with more flexibility for new mime detection 
algorithm.



Perhaps it may be a good idea to extract out the detector logic from the 
MimeTypes class, and create an independent detector for Tika.

  was:
The Class MIMETypes does not currently allow for inheritance.

There are a couple of methods in this class which looks independent, and some 
of which needs to be exposed or overwritten for special needs or use cases, 
this will enable tika users with more flexibility for new mime detection 
algorithm.


 Inheritance modification for the class MIMETypes
 

 Key: TIKA-1535
 URL: https://issues.apache.org/jira/browse/TIKA-1535
 Project: Tika
  Issue Type: Improvement
  Components: mime
Reporter: Luke sh
Priority: Trivial

 The Class MIMETypes does not currently allow for inheritance.
 There are a couple of methods in this class which looks independent, and some 
 of which needs to be exposed or overwritten for special needs or use cases, 
 this will enable tika users with more flexibility for new mime detection 
 algorithm.
 
 Perhaps it may be a good idea to extract out the detector logic from the 
 MimeTypes class, and create an independent detector for Tika.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1517) MIME type selection with probability

2015-01-28 Thread Luke sh (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14295928#comment-14295928
 ] 

Luke sh commented on TIKA-1517:
---

the probability selection will inherit the class MIMETypes, which needs to be 
modified by exposing some of its methods for being able to inherit and 
overwrite, TIKA-1535.

 MIME type selection with probability
 

 Key: TIKA-1517
 URL: https://issues.apache.org/jira/browse/TIKA-1517
 Project: Tika
  Issue Type: Improvement
  Components: mime
Affects Versions: 0.1-incubating, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 
 0.10, 1.0, 1.1, 1.2, 1.3, 1.4, 1.5, 1.6
Reporter: Luke sh
 Attachments: BaysianTest.java


 Improvement and intuition
 The original implementation for MIME type selection/detection is a bit less 
 flexible by initial design, as it heavily relies on the outcome produced by 
 magic-bytes MIME Type identification; Thus e.g. if magic-bytes is applicable 
 in a file, Tika will follow the file type detected by magic-bytes. It may be 
 better to provide more control over the method of choice.
 This proposed approach slightly incorporate the Bayesian probability theorem, 
 where users are able to assign weights to each approach in terms of 
 probability, so they have the control over preference of which file type or 
 mime type identification methods implemented/available in Tika, and currently 
 there are 3 methods for identifying MIME type in Tika (i.e. Magic-Bytes, File 
 extension and Metadata content-type hint). By introducing some weights on the 
 approach in the proposed approach, users are able to choose which method they 
 trust most, the magic-bytes method is often trust-worthy though. But the 
 virtue is that in some situations, file type identification must be 
 sensitive, some might want all of the MIME type identification methods to 
 agree on the same file type before they start processing those files, 
 incorrect file type identification is less intolerable. The current 
 implementation seems to be less flexible for this purpose and heavily rely on 
 the Magic-bytes file identification method (although magic-bytes is most 
 reliable compared to the other 2 ); 
 Proposed design:
 The idea of selection is to incorporate probability as weights on each MIME 
 type identification method currently being implemented in Tika (they are 
 Magic bytes approach, file extension match and metadata content-type hint).
 for example,
 as an user, i would probably like to assign the the preference to the method 
 based on the degree of the trust, and order the results if they don't 
 coincide.
 Bayesian rule may be a bit appropriate here to meet the intuition.
 The following is what are needed for Bayesian rule implementation.
  Prior probability P(file_type) e.g. P(pdf), theoretically this is computed 
  based on the samples, and this depends on the domain or use cases, 
  intuitively we more care the orders of the weights or probability of the 
  results rather than the actual numbers, and also the context of Prior 
  depends on samples for a particular use case or domain, e.g. if we happen 
  to crawl a website that contains mostly the pdf files, we probably can 
  collect some samples and compute the prior, based on the samples we can say 
  90% of docs are pdf, so our prior is defined to be P(pdf) = 0.9, but here 
  we propose to define the prior as configurable param for users, and by 
  default we leave the prior to be unapplicable. Alternatively, we can 
  define prior for each file type to be  1/[number of supported file types in 
  Tika] I think the number would be approximately 1/1157 and using this 
  number seems to be more fair, but the point of avoiding it is that this 
  prior is fixed for every type, and eventually we care more the orders of 
  the result and if the number is fixed, so will the order be, bringing this 
  number of 1/1157 into the Bayesian equation will not only be unable to 
  affect the order but also it will lumber our implementation with extra 
  computation, thus we will leave it as unapplicable which means we assign 
  1 to it as it never exists! but note we care more the order rather the 
  actual number, and this param is configurable, and we believe it provides 
  much flexibilities in some use cases.
  Conditional probability of positive tests given a file type P(test| 
  file_type) e.g. P(test1 = pdf | pdf), this probability is also based on 
  collection of samples and domain or use cases, we leave it configurable, 
  but based on our intuition we think test1(i.e. Magic-bytes method) is most 
  trustworthy, thus the default value is 0.75 for P(test1 = a_file_type | 
  a_file_type), this is to say given the file whose type is a file type, 
  the probability of the test1 predicting the file is a_file_type is

[jira] [Commented] (TIKA-1535) Inheritance modification for the class MIMETypes

2015-01-28 Thread Luke sh (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14295922#comment-14295922
 ] 

Luke sh commented on TIKA-1535:
---

TIKA-1517, the mime type selection mechanism with probability will be 
implemented by inheriting this class MIMEtypes, and MIMETypes is currently 
defined with final and some of its methods are tied with private modifier 
which does not allow for overwriting.  



 Inheritance modification for the class MIMETypes
 

 Key: TIKA-1535
 URL: https://issues.apache.org/jira/browse/TIKA-1535
 Project: Tika
  Issue Type: Improvement
  Components: mime
Reporter: Luke sh
Priority: Trivial

 The Class MIMETypes does not currently allow for inheritance.
 There are a couple of methods in this class which looks independent, and some 
 of which needs to be exposed or overwritten for special needs or use cases, 
 this will enable tika users with more flexibility for new mime detection 
 algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (TIKA-1517) MIME type selection with probability

2015-01-28 Thread Luke sh (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14295928#comment-14295928
 ] 

Luke sh edited comment on TIKA-1517 at 1/28/15 11:06 PM:
-

the probability selection implementation will inherit the class MIMETypes, 
which needs to be modified by exposing some of its methods for being able to 
inherit and overwrite, TIKA-1535.


was (Author: lukeliush):
the probability selection will inherit the class MIMETypes, which needs to be 
modified by exposing some of its methods for being able to inherit and 
overwrite, TIKA-1535.

 MIME type selection with probability
 

 Key: TIKA-1517
 URL: https://issues.apache.org/jira/browse/TIKA-1517
 Project: Tika
  Issue Type: Improvement
  Components: mime
Affects Versions: 0.1-incubating, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 
 0.10, 1.0, 1.1, 1.2, 1.3, 1.4, 1.5, 1.6
Reporter: Luke sh
 Attachments: BaysianTest.java


 Improvement and intuition
 The original implementation for MIME type selection/detection is a bit less 
 flexible by initial design, as it heavily relies on the outcome produced by 
 magic-bytes MIME Type identification; Thus e.g. if magic-bytes is applicable 
 in a file, Tika will follow the file type detected by magic-bytes. It may be 
 better to provide more control over the method of choice.
 This proposed approach slightly incorporate the Bayesian probability theorem, 
 where users are able to assign weights to each approach in terms of 
 probability, so they have the control over preference of which file type or 
 mime type identification methods implemented/available in Tika, and currently 
 there are 3 methods for identifying MIME type in Tika (i.e. Magic-Bytes, File 
 extension and Metadata content-type hint). By introducing some weights on the 
 approach in the proposed approach, users are able to choose which method they 
 trust most, the magic-bytes method is often trust-worthy though. But the 
 virtue is that in some situations, file type identification must be 
 sensitive, some might want all of the MIME type identification methods to 
 agree on the same file type before they start processing those files, 
 incorrect file type identification is less intolerable. The current 
 implementation seems to be less flexible for this purpose and heavily rely on 
 the Magic-bytes file identification method (although magic-bytes is most 
 reliable compared to the other 2 ); 
 Proposed design:
 The idea of selection is to incorporate probability as weights on each MIME 
 type identification method currently being implemented in Tika (they are 
 Magic bytes approach, file extension match and metadata content-type hint).
 for example,
 as an user, i would probably like to assign the the preference to the method 
 based on the degree of the trust, and order the results if they don't 
 coincide.
 Bayesian rule may be a bit appropriate here to meet the intuition.
 The following is what are needed for Bayesian rule implementation.
  Prior probability P(file_type) e.g. P(pdf), theoretically this is computed 
  based on the samples, and this depends on the domain or use cases, 
  intuitively we more care the orders of the weights or probability of the 
  results rather than the actual numbers, and also the context of Prior 
  depends on samples for a particular use case or domain, e.g. if we happen 
  to crawl a website that contains mostly the pdf files, we probably can 
  collect some samples and compute the prior, based on the samples we can say 
  90% of docs are pdf, so our prior is defined to be P(pdf) = 0.9, but here 
  we propose to define the prior as configurable param for users, and by 
  default we leave the prior to be unapplicable. Alternatively, we can 
  define prior for each file type to be  1/[number of supported file types in 
  Tika] I think the number would be approximately 1/1157 and using this 
  number seems to be more fair, but the point of avoiding it is that this 
  prior is fixed for every type, and eventually we care more the orders of 
  the result and if the number is fixed, so will the order be, bringing this 
  number of 1/1157 into the Bayesian equation will not only be unable to 
  affect the order but also it will lumber our implementation with extra 
  computation, thus we will leave it as unapplicable which means we assign 
  1 to it as it never exists! but note we care more the order rather the 
  actual number, and this param is configurable, and we believe it provides 
  much flexibilities in some use cases.
  Conditional probability of positive tests given a file type P(test| 
  file_type) e.g. P(test1 = pdf | pdf), this probability is also based on 
  collection of samples and domain or use cases, we leave it configurable, 
  but based on our intuition

[jira] [Commented] (TIKA-1521) Handle password protected 7zip files

2015-01-27 Thread Luke sh (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14293944#comment-14293944
 ] 

Luke sh commented on TIKA-1521:
---

Hi @Nick Burch,

I just came across this problem, i had a quick a look at it, I also run a quick 
test with the artifacts attached in the ticket, i think i was able to replicate 
the problem (i.e. 
https://builds.apache.org/job/tika-trunk-jdk1.6/425/org.apache.tika$tika-parsers/testReport/junit/org.apache.tika.parser.pkg/Seven7ParserTest/testPasswordProtected/
 ), but i don't see this have anything to do with JDK version or platform 
version, the following is what i got.

I think this might have something to do with the 
org.commons.compress.archivers.sevenZ.SevenZFile, or probably the way 
commons.compress is invoked. My platform is windows 8.1, the test is failing 
with both jdk 1.6 and jdk 1.8 


java.lang.AssertionError: text.txt not found in:

at org.junit.Assert.fail(Assert.java:88)
at org.junit.Assert.assertTrue(Assert.java:41)
at org.apache.tika.TikaTest.assertContains(TikaTest.java:85)
at 
org.apache.tika.parser.pkg.Seven7ParserTest.testPasswordProtected(Seven7ParserTest.java:195)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at 
org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
at 
org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:50)
at 
org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467)
at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683)
at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390)
at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197)



 Handle password protected 7zip files
 

 Key: TIKA-1521
 URL: https://issues.apache.org/jira/browse/TIKA-1521
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.7
Reporter: Nick Burch
 Fix For: 1.8


 While working on TIKA-1028, I notice that while Commons Compress doesn't 
 currently handle decrypting password protected zip files, it does handle 
 password protected 7zip files
 We should therefore add logic into the package parser to spot password 
 protected 7zip files, and fetch the password for them from a PasswordProvider 
 if given



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1517) MIME type detection with probability

2015-01-17 Thread Luke sh (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14281254#comment-14281254
 ] 

Luke sh commented on TIKA-1517:
---

Basic feature design:
Probability selection mechanism is incorporated only in the MIMEtypes detector 
in Tika, There are currently 4 detectors implemented in Tika, they are 
org.gagravarr.tika.OggDetector, 
org.apache.tika.parser.microsoft.POIFSContainerDetector, 
org.apache.tika.parser.pkg.ZipContainerDetector and lastly the 
org.apache.tika.mime.MimeTypes.

Other than MimeTypes detector, the other 3 detectors make calls to some other 
open APIs in detecting the MIME types, if there is a inconsistent MIME type 
detected between detectors, Tika will probably make the choice with the type 
detected in the preferential order above.

Inside the MimeTypes detector, the probability selection mechanism is invoked 
inside the detect() method, the method is called 
applyProbilities(ListMimeType:: possibleTypes, MimeType:: extMimeType, 
MimeType:: metadataMimeType).
possibleTypes is a list of MimeTypes estimated by Magic-byte method. It is 
possible that Magic-bytes method estimate more than one types, it is also 
assumed the list maintain an order of precedence, the first element has higher 
weights. applyProbabilities method calculate the posterior probability for each 
file type estimated by each type detection method, and keep track the one that 
has higher posterior probability and eventually return it as result. it is also 
worth noting that a type might be a super type of another and they belong to 
the class of types, so at the beginning of the method,  there is an extra 
procedure where the types from the same class will be reset to the one that is 
mostly specific. E.g. if type A is a super type of B, (and note A and B belong 
to the same class of type, but type A and type B could be different types 
estimated by different method), then A is reset to B, or vise-versa. Then for 
each type we compute the posterior probability with conditions on the types 
detected by each method.

By default, the probability selection mechanism is disabled, in order to enable 
it, set useProbSelection:: boolean to true (the default value is false); 


To be continued





 MIME type detection with probability
 

 Key: TIKA-1517
 URL: https://issues.apache.org/jira/browse/TIKA-1517
 Project: Tika
  Issue Type: Improvement
  Components: mime
Affects Versions: 0.1-incubating, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 
 0.10, 1.0, 1.1, 1.2, 1.3, 1.4, 1.5, 1.6
Reporter: Luke sh
 Attachments: BaysianTest.java


 Improvement and intuition
 The original implementation for MIME type selection/detection is a bit less 
 flexible by initial design, as it heavily relies on the outcome produced by 
 magic-bytes MIME Type identification; Thus e.g. if magic-bytes is applicable 
 in a file, Tika will follow the file type detected by magic-bytes. It may be 
 better to provide more control over the method of choice.
 This proposed approach slightly incorporate the Bayesian probability theorem, 
 where users are able to assign weights to each approach in terms of 
 probability, so they have the control over preference of which file type or 
 mime type identification methods implemented/available in Tika, and currently 
 there are 3 methods for identifying MIME type in Tika (i.e. Magic-Bytes, File 
 extension and Metadata content-type hint). By introducing some weights on the 
 approach in the proposed approach, users are able to choose which method they 
 trust most, the magic-bytes method is often trust-worthy though. But the 
 virtue is that in some situations, file type identification must be 
 sensitive, some might want all of the MIME type identification methods to 
 agree on the same file type before they start processing those files, 
 incorrect file type identification is less intolerable. The current 
 implementation seems to be less flexible for this purpose and heavily rely on 
 the Magic-bytes file identification method (although magic-bytes is most 
 reliable compared to the other 2 ); 
 Proposed design:
 The idea of selection is to incorporate probability as weights on each MIME 
 type identification method currently being implemented in Tika (they are 
 Magic bytes approach, file extension match and metadata content-type hint).
 for example,
 as an user, i would probably like to assign the the preference to the method 
 based on the degree of the trust, and order the results if they don't 
 coincide.
 Bayesian rule may be a bit appropriate here to meet the intuition.
 The following is what are needed for Bayesian rule implementation.
  Prior probability P(file_type) e.g. P(pdf), theoretically this is computed 
  based on the samples, and this depends on the domain or use cases,

[jira] [Updated] (TIKA-1517) MIME type detection with probability

2015-01-15 Thread Luke sh (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luke sh updated TIKA-1517:
--
Description: 
Improvement and intuition
The original implementation for MIME type selection/detection is a bit less 
flexible by initial design, as it heavily relies on the outcome produced by 
magic-bytes MIME Type identification; Thus e.g. if magic-bytes is applicable in 
a file, Tika will follow the file type detected by magic-bytes. It may be 
better to provide more control over the method of choice.

This proposed approach slightly incorporate the Bayesian probability theorem, 
where users are able to assign weights to each approach in terms of 
probability, so they have the control over preference of which file type or 
mime type identification methods implemented/available in Tika, and currently 
there are 3 methods for identifying MIME type in Tika (i.e. Magic-Bytes, File 
extension and Metadata content-type hint). By introducing some weights on the 
approach in the proposed approach, users are able to choose which method they 
trust most, the magic-bytes method is often trust-worthy though. But the virtue 
is that in some situations, file type identification must be sensitive, some 
might want all of the MIME type identification methods to agree on the same 
file type before they start processing those files, incorrect file type 
identification is less intolerable. The current implementation seems to be less 
flexible for this purpose and heavily rely on the Magic-bytes file 
identification method (although magic-bytes is most reliable compared to the 
other 2 ); 


Proposed design:
The idea of selection is to incorporate probability as weights on each MIME 
type identification method currently being implemented in Tika (they are Magic 
bytes approach, file extension match and metadata content-type hint).

for example,
as an user, i would probably like to assign the the preference to the method 
based on the degree of the trust, and order the results if they don't coincide.
Bayesian rule may be a bit appropriate here to meet the intuition.
The following is what are needed for Bayesian rule implementation.

 Prior probability P(file_type) e.g. P(pdf), theoretically this is computed 
 based on the samples, and this depends on the domain or use cases, 
 intuitively we more care the orders of the weights or probability of the 
 results rather than the actual numbers, and also the context of Prior depends 
 on samples for a particular use case or domain, e.g. if we happen to crawl a 
 website that contains mostly the pdf files, we probably can collect some 
 samples and compute the prior, based on the samples we can say 90% of docs 
 are pdf, so our prior is defined to be P(pdf) = 0.9, but here we propose to 
 define the prior as configurable param for users, and by default we leave the 
 prior to be unapplicable. Alternatively, we can define prior for each file 
 type to be  1/[number of supported file types in Tika] I think the number 
 would be approximately 1/1157 and using this number seems to be more fair, 
 but the point of avoiding it is that this prior is fixed for every type, and 
 eventually we care more the orders of the result and if the number is fixed, 
 so will the order be, bringing this number of 1/1157 into the Bayesian 
 equation will not only be unable to affect the order but also it will lumber 
 our implementation with extra computation, thus we will leave it as 
 unapplicable which means we assign 1 to it as it never exists! but note we 
 care more the order rather the actual number, and this param is configurable, 
 and we believe it provides much flexibilities in some use cases.


 Conditional probability of positive tests given a file type P(test| 
 file_type) e.g. P(test1 = pdf | pdf), this probability is also based on 
 collection of samples and domain or use cases, we leave it configurable, but 
 based on our intuition we think test1(i.e. Magic-bytes method) is most 
 trustworthy, thus the default value is 0.75 for P(test1 = a_file_type | 
 a_file_type), this is to say given the file whose type is a file type, the 
 probability of the test1 predicting the file is a_file_type is 0.75, that 
 is really our intuition, as we trust test1 most, next we propose to use 0.7 
 for test3, and 0.65 for test2;
(note again, test1 = magic-bytes, test2 = file extension, test3 = Metadata 
Content-type hint)

 Conditional probability of negative tests also need to be intuitively defined.
E.g. By default, given a file type that is not pdf, the probability of test1 
predicting it is pdf is 1-P(test1 = pdf | pdf), thus P(test1=pdf | ~pdf) = 1- 
0.75 = 0.25, as we trust the test1 the most, the other tests are defined with 
0.35 and 0.3 respectively with the same intuition.

 
 The goal is to find out 
P(file_type | test1 = file_type, test2=file_type, test3=file_type)

(Please note, we are mostly interested in

52 matches

Mail list logo