[jira] [Commented] (TIKA-1582) Content-based Mime Detection with Byte-frequency-histogram
[ https://issues.apache.org/jira/browse/TIKA-1582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14537405#comment-14537405 ] Luke sh commented on TIKA-1582: --- Thanks professor [~chrismattmann] Hi all, I just created a wiki page with this feature (draft), please kindly review it and kindly let me know if any confusion. https://wiki.apache.org/tika/ContentMimeDetection Thanks Content-based Mime Detection with Byte-frequency-histogram --- Key: TIKA-1582 URL: https://issues.apache.org/jira/browse/TIKA-1582 Project: Tika Issue Type: Improvement Components: detector, mime Affects Versions: 1.7 Reporter: Luke sh Assignee: Chris A. Mattmann Priority: Trivial Labels: memex Fix For: 1.9 Attachments: nnmodel.docx, week2-report-histogram comparison.docx, week6 report.docx Content-based mime type detection is one of the popular approaches to detect mime type, there are others based on file extension and magic numbers ; And currently Tika has implemented 3 approaches in detecting mime types; They are : 1) file extensions 2) magic numbers (the most trustworthy in tika) 3) content-type(the header in the http response if present and available) Content-based mime type detection however analyses the distribution of the entire stream of bytes and find a similar pattern for the same type and build a function that is able to group them into one or several classes so as to classify and predict; It is believed this feature might broaden the usage of Tika with a bit more security enforcement for mime type detection. Because we want to build a model that is etched with the patterns it has seen, in some situations we may not trust those types which have not been trained/learned by the model. In some situations, magic numbers imbedded in the files can be copied but the actual content could be a potentially detrimental Troy program. By enforcing the trust on byte frequency patterns, we are able to enhance the security of the detection. The proposed content-based mime detection to be integrated into Tika is based on the machine learning algorithm i.e. neural network with back-propagation. The input: 0-255 bins each of which represent a byte, and and each of which stores the count of occurrences for each byte, and the byte frequency histograms are normalized to fall in the range between 0 and 1, they then are passed to a companding function to enhancement the infrequent bytes. The output of the neural network is a binary decision 1 or 0; Notice BTW, the proposed feature will be implemented with GRB file type as one example. In this example, we build a model that is able to classify GRB file type from non-GRB file types, notice the size of non-GRB files is huge and cannot be easily defined, so there need to be as many negative training example as possible to form this non-GRB types decision boundary. The Neural networks is considered as two stage of processes. Training and classification. The training can be done in any programming language, in this feature /research, the training of neural network is implemented in R and the source can be found in my github repository i.e. https://github.com/LukeLiush/filetypeDetection; i am also going to post a document that describe the use of the program, the syntax/ format of the input and output. After training, we need to export the model and import it to Tika; in Tika, we create a TrainedModelDetector that reads this model file with one or more model parameters or several model files,so it can detect the mime types with the model of those mime types. Details of the research and usage with this proposed feature will be posted on my github shortly. It is worth noting again that in this research we only worked out one model - GRB as one example to demonstrate the use of this content-based mime detection. One of the challenges again is that the non-GRB file types cannot be clearly defined unless we feed our model with some example data for all of the existing file types in the world, but this seems to be too utopian and a bit less likely, so it is better that the set of class/types is given and defined in advance to minimize the problem domain. Another challenge is the size of the training data; even if we know the types we want to classify, getting enough training data to form a model can be also one of the main factors of success. In our example model, grb data are collected from ftp://hydro1.sci.gsfc.nasa.gov/data/; and we find out that the grb data from that source all exhibit a similar pattern, a simple neural network structure is able to predict well, even a linear logistic regression is able to do a good
[jira] [Updated] (TIKA-1582) Content-based Mime Detection with Byte-frequency-histogram
[ https://issues.apache.org/jira/browse/TIKA-1582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luke sh updated TIKA-1582: -- Summary: Content-based Mime Detection with Byte-frequency-histogram (was: Mime Detection based on neural networks with Byte-frequency-histogram ) Content-based Mime Detection with Byte-frequency-histogram --- Key: TIKA-1582 URL: https://issues.apache.org/jira/browse/TIKA-1582 Project: Tika Issue Type: Improvement Components: detector, mime Affects Versions: 1.7 Reporter: Luke sh Assignee: Chris A. Mattmann Priority: Trivial Labels: memex Fix For: 1.9 Attachments: nnmodel.docx, week2-report-histogram comparison.docx, week6 report.docx Content-based mime type detection is one of the popular approaches to detect mime type, there are others based on file extension and magic numbers ; And currently Tika has implemented 3 approaches in detecting mime types; They are : 1) file extensions 2) magic numbers (the most trustworthy in tika) 3) content-type(the header in the http response if present and available) Content-based mime type detection however analyses the distribution of the entire stream of bytes and find a similar pattern for the same type and build a function that is able to group them into one or several classes so as to classify and predict; It is believed this feature might broaden the usage of Tika with a bit more security enforcement for mime type detection. Because we want to build a model that is etched with the patterns it has seen, in some situations we may not trust those types which have not been trained/learned by the model. In some situations, magic numbers imbedded in the files can be copied but the actual content could be a potentially detrimental Troy program. By enforcing the trust on byte frequency patterns, we are able to enhance the security of the detection. The proposed content-based mime detection to be integrated into Tika is based on the machine learning algorithm i.e. neural network with back-propagation. The input: 0-255 bins each of which represent a byte, and and each of which stores the count of occurrences for each byte, and the byte frequency histograms are normalized to fall in the range between 0 and 1, they then are passed to a companding function to enhancement the infrequent bytes. The output of the neural network is a binary decision 1 or 0; Notice BTW, the proposed feature will be implemented with GRB file type as one example. In this example, we build a model that is able to classify GRB file type from non-GRB file types, notice the size of non-GRB files is huge and cannot be easily defined, so there need to be as many negative training example as possible to form this non-GRB types decision boundary. The Neural networks is considered as two stage of processes. Training and classification. The training can be done in any programming language, in this feature /research, the training of neural network is implemented in R and the source can be found in my github repository i.e. https://github.com/LukeLiush/filetypeDetection; i am also going to post a document that describe the use of the program, the syntax/ format of the input and output. After training, we need to export the model and import it to Tika; in Tika, we create a TrainedModelDetector that reads this model file with one or more model parameters or several model files,so it can detect the mime types with the model of those mime types. Details of the research and usage with this proposed feature will be posted on my github shortly. It is worth noting again that in this research we only worked out one model - GRB as one example to demonstrate the use of this content-based mime detection. One of the challenges again is that the non-GRB file types cannot be clearly defined unless we feed our model with some example data for all of the existing file types in the world, but this seems to be too utopian and a bit less likely, so it is better that the set of class/types is given and defined in advance to minimize the problem domain. Another challenge is the size of the training data; even if we know the types we want to classify, getting enough training data to form a model can be also one of the main factors of success. In our example model, grb data are collected from ftp://hydro1.sci.gsfc.nasa.gov/data/; and we find out that the grb data from that source all exhibit a similar pattern, a simple neural network structure is able to predict well, even a linear logistic regression is able to do a good job; However, if we pass the GRB files collected from other source to the model for prediction, then we find out that the
[jira] [Commented] (TIKA-1582) Mime Detection based on neural networks with Byte-frequency-histogram
[ https://issues.apache.org/jira/browse/TIKA-1582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526717#comment-14526717 ] Luke sh commented on TIKA-1582: --- Thanks a lot [~talli...@apache.org] for the comments, here are my thoughts. [Tim]: I'm not sure how to use it, or even how to know when I'd want to. [Luke]:The idea behind this feature is to give an option in Tika that allows users to apply content-based mime detection, the algorithm itself does not seem to matter much, neural network, svm, baysian, etc. and there is no single best machine learning algorithm that fit into every data problem, e.g. users can also use a simple linear classification technique to classify their file types as long as it meets their goals, this also requires a bit empirical analysis with those learning algorithms. Nevertheless, in my opinion what it matters may be the data they use to classify. The patterns or the knowledge that comes from the data may be specialized in one domain, the understanding requires a bit domain knowledge which may be the key to develop a high-accuracy learning system. Alternatively, i might ask myself can a human expert classify the file types by looking at the input X (e.g. histogram, actual bytes); if we think about every existing types in the world, then i probably dont think a human is able to learn that accurately; but if we considered some types (1,2 or several) well defined, i would probably say the detection accuracy could be much higher. when users want to have more security or insurance with some particular file types detection, they probably can define or develop their own learning algorithm, they can use svm, baysian, neural net, etc (whatever they want) to further undergird the security detection, as long as they have trained a good model. From this perspectives, the users might also need a bit of knowledge of the machine learning algorithm they want to use. I have not taken a closer look at the links e.g. http://www.dfrws.org/2012/proceedings/DFRWS2012-5.pdf, but i guess the tests are based on some file types, the accuracy may not be 100% for each type in the tests; In essence the machine learning algorithms might be good at estimation with exciting knowledge. Again if we apply our existing knowledge in the detection, we probably can enhance the detection security. If you have any confusion, please kindly let me know, any kind comments are welcome and appreciated. Thanks Mime Detection based on neural networks with Byte-frequency-histogram -- Key: TIKA-1582 URL: https://issues.apache.org/jira/browse/TIKA-1582 Project: Tika Issue Type: Improvement Components: detector, mime Affects Versions: 1.7 Reporter: Luke sh Assignee: Chris A. Mattmann Priority: Trivial Labels: memex Fix For: 1.9 Attachments: nnmodel.docx, week2-report-histogram comparison.docx, week6 report.docx Content-based mime type detection is one of the popular approaches to detect mime type, there are others based on file extension and magic numbers ; And currently Tika has implemented 3 approaches in detecting mime types; They are : 1) file extensions 2) magic numbers (the most trustworthy in tika) 3) content-type(the header in the http response if present and available) Content-based mime type detection however analyses the distribution of the entire stream of bytes and find a similar pattern for the same type and build a function that is able to group them into one or several classes so as to classify and predict; It is believed this feature might broaden the usage of Tika with a bit more security enforcement for mime type detection. Because we want to build a model that is etched with the patterns it has seen, in some situations we may not trust those types which have not been trained/learned by the model. In some situations, magic numbers imbedded in the files can be copied but the actual content could be a potentially detrimental Troy program. By enforcing the trust on byte frequency patterns, we are able to enhance the security of the detection. The proposed content-based mime detection to be integrated into Tika is based on the machine learning algorithm i.e. neural network with back-propagation. The input: 0-255 bins each of which represent a byte, and and each of which stores the count of occurrences for each byte, and the byte frequency histograms are normalized to fall in the range between 0 and 1, they then are passed to a companding function to enhancement the infrequent bytes. The output of the neural network is a binary decision 1 or 0; Notice BTW, the proposed feature will be implemented with GRB file type as one example. In this
[jira] [Commented] (TIKA-1582) Mime Detection based on neural networks with Byte-frequency-histogram
[ https://issues.apache.org/jira/browse/TIKA-1582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14525379#comment-14525379 ] Luke sh commented on TIKA-1582: --- Sure [~chrismattmann], i will work on the wiki for this content detection feature TIKA-1582. Thanks [~gagravarr] for the comments, but https://wiki.apache.org/tika/BaysianMimeTypeSelector is a separate feature that corresponds to TIKA-1517. Thanks Mime Detection based on neural networks with Byte-frequency-histogram -- Key: TIKA-1582 URL: https://issues.apache.org/jira/browse/TIKA-1582 Project: Tika Issue Type: Improvement Components: detector, mime Affects Versions: 1.7 Reporter: Luke sh Assignee: Chris A. Mattmann Priority: Trivial Labels: memex Fix For: 1.9 Attachments: nnmodel.docx, week2-report-histogram comparison.docx, week6 report.docx Content-based mime type detection is one of the popular approaches to detect mime type, there are others based on file extension and magic numbers ; And currently Tika has implemented 3 approaches in detecting mime types; They are : 1) file extensions 2) magic numbers (the most trustworthy in tika) 3) content-type(the header in the http response if present and available) Content-based mime type detection however analyses the distribution of the entire stream of bytes and find a similar pattern for the same type and build a function that is able to group them into one or several classes so as to classify and predict; It is believed this feature might broaden the usage of Tika with a bit more security enforcement for mime type detection. Because we want to build a model that is etched with the patterns it has seen, in some situations we may not trust those types which have not been trained/learned by the model. In some situations, magic numbers imbedded in the files can be copied but the actual content could be a potentially detrimental Troy program. By enforcing the trust on byte frequency patterns, we are able to enhance the security of the detection. The proposed content-based mime detection to be integrated into Tika is based on the machine learning algorithm i.e. neural network with back-propagation. The input: 0-255 bins each of which represent a byte, and and each of which stores the count of occurrences for each byte, and the byte frequency histograms are normalized to fall in the range between 0 and 1, they then are passed to a companding function to enhancement the infrequent bytes. The output of the neural network is a binary decision 1 or 0; Notice BTW, the proposed feature will be implemented with GRB file type as one example. In this example, we build a model that is able to classify GRB file type from non-GRB file types, notice the size of non-GRB files is huge and cannot be easily defined, so there need to be as many negative training example as possible to form this non-GRB types decision boundary. The Neural networks is considered as two stage of processes. Training and classification. The training can be done in any programming language, in this feature /research, the training of neural network is implemented in R and the source can be found in my github repository i.e. https://github.com/LukeLiush/filetypeDetection; i am also going to post a document that describe the use of the program, the syntax/ format of the input and output. After training, we need to export the model and import it to Tika; in Tika, we create a TrainedModelDetector that reads this model file with one or more model parameters or several model files,so it can detect the mime types with the model of those mime types. Details of the research and usage with this proposed feature will be posted on my github shortly. It is worth noting again that in this research we only worked out one model - GRB as one example to demonstrate the use of this content-based mime detection. One of the challenges again is that the non-GRB file types cannot be clearly defined unless we feed our model with some example data for all of the existing file types in the world, but this seems to be too utopian and a bit less likely, so it is better that the set of class/types is given and defined in advance to minimize the problem domain. Another challenge is the size of the training data; even if we know the types we want to classify, getting enough training data to form a model can be also one of the main factors of success. In our example model, grb data are collected from ftp://hydro1.sci.gsfc.nasa.gov/data/; and we find out that the grb data from that source all exhibit a similar pattern, a simple neural network structure is able to predict well,
[jira] [Commented] (TIKA-1517) MIME type selection with probability
[ https://issues.apache.org/jira/browse/TIKA-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14512284#comment-14512284 ] Luke sh commented on TIKA-1517: --- Notes: A Pull request with adding the support to Tika() facet. If a user wants to use this feature, the following code would be needed. tika = new Tika(new TikaConfig() { @Override protected Detector getDefaultDetector(MimeTypes types,ServiceLoader loader) { /* * here is an example with the use of the builder to * instantiate the object. */ Builder builder = new ProbabilisticMimeDetectionSelector.Builder(); ProbabilisticMimeDetectionSelector proDetector = new ProbabilisticMimeDetectionSelector( types, builder.priorMagicFileType(0.5f) .priorExtensionFileType(0.5f) .priorMetaFileType(0.5f)); return new DefaultProbDetector(proDetector, loader); } }); The idea is simple that we overwrite the getDefaultDetector() by providing the DefaultProbDetector which extends the CompositeDetector, a CompositeDetector is one (whose supertype is Detector) that takes a list of detectors, and when its detect() method gets called, each detector in the list is called sequentially one after another. The original implemenation of getDefaultDetector() in TikaConfig returns an instance of “DefaultDetector” that also extends the CompositeDetector by providing a list of detectors that includes “MimeTypes” which is the native implemenation with 3 detectors (i.e. magic bytes, extension and metadatahint). However, DefaultProbDetector replaces this MimeTypes with ProbabilisticMimeDetectionSelector. In order to set the preferential weights, an instance of ProbabilisticMimeDetectionSelector can be created in the above example snippet. Alternatively, if we dont want to go with the default settings with ProbabilisticMimeDetectionSelector, it is ok to just ignore the arguments by calling only “return new DefaultProbDetector();”. Alternatively, if we dont want to write some extra code, the following can also be used. /* * an xml file needs to be given to tell where the detector is located, customers can build * their own detectors by including or excluding this feature or any detectors in the * composite list at their will. */ Tika tika = new Tika(new TikaConfig(new File(TIKA-detector-sample.xml))); TIKA-detector-sample.xml ?xml version=1.0 encoding=UTF-8? !-- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to You under the Apache License, Version 2.0 (the License); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an AS IS BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. -- properties detectors detector class=org.apache.tika.detect.DefaultProbDetector/ /detectors /properties MIME type selection with probability Key: TIKA-1517 URL: https://issues.apache.org/jira/browse/TIKA-1517 Project: Tika Issue Type: Improvement Components: mime Affects Versions: 0.1-incubating, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.10, 1.0, 1.1, 1.2, 1.3, 1.4, 1.5, 1.6 Reporter: Luke sh Priority: Trivial Attachments: BaysianTest.java Improvement and intuition The original implementation for MIME type selection/detection is a bit less flexible by initial design, as it heavily relies on the outcome produced by magic-bytes MIME Type identification; Thus e.g. if magic-bytes is applicable in a file, Tika will follow the file type detected by magic-bytes. It may be better to provide more control over the method of choice. This proposed approach slightly incorporate the Bayesian probability theorem, where users are able to assign weights to each approach in terms of probability, so they have the control over preference of which file type or mime type identification methods implemented/available in Tika, and currently there are 3 methods for identifying MIME type in Tika (i.e. Magic-Bytes, File extension and Metadata content-type hint). By introducing some weights on the approach in the proposed approach, users are able to choose which method they trust most, the magic-bytes method is often
[jira] [Commented] (TIKA-1517) MIME type selection with probability
[ https://issues.apache.org/jira/browse/TIKA-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14512286#comment-14512286 ] Luke sh commented on TIKA-1517: --- Notes: this feature is also tested with the data from gen-common-crawl.sh (https://github.com/LukeLiush/trec-dd-polar ) this feature with default settings behaves as expected with that data in the test. The following is copied from the email update for the test. Both (tika with the prob feature and the one without it) produced the same stats total, please see the attached matched.txt dumped by the small program that verbatim checks and compares each line in every section of the Stats total between the log produced by the tika that has the feature and the one without it; so if the string.equals(...) satisfies, the string of the line will be dumped out. If there is a mismatch(e.g. the count for a particular mime type is different), an error will be dumped out. Eventually, I don’t see any error in the printout, I think the feature seem to have passed the test. The processing time between 2 tests is as follows. The following shows the start time and end time for the test with the Nutch dumper tool integrated with the prob selection feature. from 2015-04-22 15:47:08,330 to 2015-04-22 17:48:28,877 The following shows the start time and end time for the test without the prob selection feature. from 2015-04-22 22:41:23,459 to 2015-04-23 00:11:02,767 MIME type selection with probability Key: TIKA-1517 URL: https://issues.apache.org/jira/browse/TIKA-1517 Project: Tika Issue Type: Improvement Components: mime Affects Versions: 0.1-incubating, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.10, 1.0, 1.1, 1.2, 1.3, 1.4, 1.5, 1.6 Reporter: Luke sh Priority: Trivial Attachments: BaysianTest.java Improvement and intuition The original implementation for MIME type selection/detection is a bit less flexible by initial design, as it heavily relies on the outcome produced by magic-bytes MIME Type identification; Thus e.g. if magic-bytes is applicable in a file, Tika will follow the file type detected by magic-bytes. It may be better to provide more control over the method of choice. This proposed approach slightly incorporate the Bayesian probability theorem, where users are able to assign weights to each approach in terms of probability, so they have the control over preference of which file type or mime type identification methods implemented/available in Tika, and currently there are 3 methods for identifying MIME type in Tika (i.e. Magic-Bytes, File extension and Metadata content-type hint). By introducing some weights on the approach in the proposed approach, users are able to choose which method they trust most, the magic-bytes method is often trust-worthy though. But the virtue is that in some situations, file type identification must be sensitive, some might want all of the MIME type identification methods to agree on the same file type before they start processing those files, incorrect file type identification is less intolerable. The current implementation seems to be less flexible for this purpose and heavily rely on the Magic-bytes file identification method (although magic-bytes is most reliable compared to the other 2 ); Proposed design: The idea of selection is to incorporate probability as weights on each MIME type identification method currently being implemented in Tika (they are Magic bytes approach, file extension match and metadata content-type hint). for example, as an user, i would probably like to assign the the preference to the method based on the degree of the trust, and order the results if they don't coincide. Bayesian rule may be a bit appropriate here to meet the intuition. The following is what are needed for Bayesian rule implementation. Prior probability P(file_type) e.g. P(pdf), theoretically this is computed based on the samples, and this depends on the domain or use cases, intuitively we more care the orders of the weights or probability of the results rather than the actual numbers, and also the context of Prior depends on samples for a particular use case or domain, e.g. if we happen to crawl a website that contains mostly the pdf files, we probably can collect some samples and compute the prior, based on the samples we can say 90% of docs are pdf, so our prior is defined to be P(pdf) = 0.9, but here we propose to define the prior as configurable param for users, and by default we leave the prior to be unapplicable. Alternatively, we can define prior for each file type to be 1/[number of supported file types in Tika] I think the number would be approximately 1/1157 and using this number seems to be
[jira] [Commented] (TIKA-1610) CBOR Parser and detection [improvement]
[ https://issues.apache.org/jira/browse/TIKA-1610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14510402#comment-14510402 ] Luke sh commented on TIKA-1610: --- Thanks a lot [~gagravarr] for the prompt response. I thought it would be probably be risky if we discard any one of the estimated types because of the magic priority (one is higher than the other, i wanted tika to rely on the extension when there is a tie to break. For now, in this particular case, i also cannot think of any reason why we don't use 60, might be i am too skeptical. Thanks CBOR Parser and detection [improvement] --- Key: TIKA-1610 URL: https://issues.apache.org/jira/browse/TIKA-1610 Project: Tika Issue Type: New Feature Components: detector, mime, parser Affects Versions: 1.7 Reporter: Luke sh Assignee: Chris A. Mattmann Priority: Trivial Labels: memex Attachments: 142440269.html, NUTCH-1997.cbor, cbor_tika.mimetypes.xml.jpg, rfc_cbor.jpg CBOR is a data format whose design goals include the possibility of extremely small code size, fairly small message size, and extensibility without the need for version negotiation (cited from http://cbor.io/ ). It would be great if Tika is able to provide the support with CBOR parser and identification. In the current project with Nutch, the Nutch CommonCrawlDataDumper is used to dump the crawled segments to the files in the format of CBOR. In order to read/parse those dumped files by this tool, it would be great if tika is able to support parsing the cbor, the thing is that the CommonCrawlDataDumper is not dumping with correct extension, it dumps with its own rule, the default extension of the dumped file is html, so it might be less painful if tika is able to detect and parse those files without any pre-processing steps. CommonCrawlDataDumper is calling the following to dump with cbor. import com.fasterxml.jackson.dataformat.cbor.CBORFactory; import com.fasterxml.jackson.dataformat.cbor.CBORGenerator; fasterxml is a 3rd party library for converting json to .cbor and Vice Versa. According to RFC 7049 (http://tools.ietf.org/html/rfc7049), it looks like CBOR does not yet have its magic numbers to be detected/identified by other applications (PFA: rfc_cbor.jpg) It seems that the only way to inform other applications of the type as of now is using the extension (i.e. .cbor), or probably content detection (i.e. byte histogram distribution estimation). There is another thing worth the attention, it looks like tika has attempted to add the support with cbor mime detection in the tika-mimetypes.xml (PFA:cbor_tika.mimetypes.xml.jpg); This detection is not working with the cbor file dumped by CommonCrawlDataDumper. According to http://tools.ietf.org/html/rfc7049#section-2.4.5, there is a self-describing Tag 55799 that seems to be used for cbor type identification(the hex code might be 0xd9d9f7), but it is probably up to the application that take care of this tag, and it is also possible that the fasterxml that the nutch dumping tool is missing this tag, an example cbor file dumped by the Nutch tool i.e. CommonCrawlDataDumper has also been attached (PFA: 142440269.html). The following info is cited from the rfc, ...a decoder might be able to parse both CBOR and JSON. Such a decoder would need to mechanically distinguish the two formats. An easy way for an encoder to help the decoder would be to tag the entire CBOR item with tag 55799, the serialization of which will never be found at the beginning of a JSON text... It looks like the a file can have two parts/sections i.e. the plain text parts and the json prettified by cbor, this might be also worth the attention and consideration with the parsing and type identification. On the other hand, it is worth noting that the entries for cbor extension detection needs to be appended in the tika-mimetypes.xml too e.g. glob pattern=*.cbor/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1610) CBOR Parser and detection [improvement]
[ https://issues.apache.org/jira/browse/TIKA-1610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luke sh updated TIKA-1610: -- Attachment: NUTCH-1997.cbor CBOR Parser and detection [improvement] --- Key: TIKA-1610 URL: https://issues.apache.org/jira/browse/TIKA-1610 Project: Tika Issue Type: New Feature Components: detector, mime, parser Affects Versions: 1.7 Reporter: Luke sh Assignee: Chris A. Mattmann Priority: Trivial Labels: memex Attachments: 142440269.html, NUTCH-1997.cbor, cbor_tika.mimetypes.xml.jpg, rfc_cbor.jpg CBOR is a data format whose design goals include the possibility of extremely small code size, fairly small message size, and extensibility without the need for version negotiation (cited from http://cbor.io/ ). It would be great if Tika is able to provide the support with CBOR parser and identification. In the current project with Nutch, the Nutch CommonCrawlDataDumper is used to dump the crawled segments to the files in the format of CBOR. In order to read/parse those dumped files by this tool, it would be great if tika is able to support parsing the cbor, the thing is that the CommonCrawlDataDumper is not dumping with correct extension, it dumps with its own rule, the default extension of the dumped file is html, so it might be less painful if tika is able to detect and parse those files without any pre-processing steps. CommonCrawlDataDumper is calling the following to dump with cbor. import com.fasterxml.jackson.dataformat.cbor.CBORFactory; import com.fasterxml.jackson.dataformat.cbor.CBORGenerator; fasterxml is a 3rd party library for converting json to .cbor and Vice Versa. According to RFC 7049 (http://tools.ietf.org/html/rfc7049), it looks like CBOR does not yet have its magic numbers to be detected/identified by other applications (PFA: rfc_cbor.jpg) It seems that the only way to inform other applications of the type as of now is using the extension (i.e. .cbor), or probably content detection (i.e. byte histogram distribution estimation). There is another thing worth the attention, it looks like tika has attempted to add the support with cbor mime detection in the tika-mimetypes.xml (PFA:cbor_tika.mimetypes.xml.jpg); This detection is not working with the cbor file dumped by CommonCrawlDataDumper. According to http://tools.ietf.org/html/rfc7049#section-2.4.5, there is a self-describing Tag 55799 that seems to be used for cbor type identification(the hex code might be 0xd9d9f7), but it is probably up to the application that take care of this tag, and it is also possible that the fasterxml that the nutch dumping tool is missing this tag, an example cbor file dumped by the Nutch tool i.e. CommonCrawlDataDumper has also been attached (PFA: 142440269.html). The following info is cited from the rfc, ...a decoder might be able to parse both CBOR and JSON. Such a decoder would need to mechanically distinguish the two formats. An easy way for an encoder to help the decoder would be to tag the entire CBOR item with tag 55799, the serialization of which will never be found at the beginning of a JSON text... It looks like the a file can have two parts/sections i.e. the plain text parts and the json prettified by cbor, this might be also worth the attention and consideration with the parsing and type identification. On the other hand, it is worth noting that the entries for cbor extension detection needs to be appended in the tika-mimetypes.xml too e.g. glob pattern=*.cbor/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-1610) CBOR Parser and detection [improvement]
[ https://issues.apache.org/jira/browse/TIKA-1610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14510382#comment-14510382 ] Luke sh edited comment on TIKA-1610 at 4/24/15 2:43 AM: Notes: The attached cbor file(i.e.NUTCH-1997.cbor) contains both magic bytes for type xhtml and type cbor, with priority 40 on application/cbor, we will have the following issues Problem1: Magic priority 40. The application/xhtml+xml has higher priority(50) than application/cbor (40); [I don't know who (and why) assigned 40 to cbor]; So if xhtml gets read and compared first, cbor will not even be placed in the magic estimation list because it has low priority. Based on the tests, it turns out that it is true that xhtml gets read and compared first with the input file, so any type below the priority 50 will be disregarded. Problem2: again magic priority with 50. In Tika, given a file dumped by the nutch dumper tool, both types (xhtml and cbor) will be selected as candidate mime types and they will be put in the magic estimation list; since xhtml type gets read first, it is placed atop the cbor; in order to break that tie, tika will rely on the decision from the extension method. If the extension method fails to detect the type(for now, let's ignore metadata hint method for simplicity but the same applies to it too), then xhtml gets returned eventually. My pull request to be sent : I am going to set the magic priority of cbor type to 50 the same as xhtml, because it would probably be risky to discard any one of the estimated types without going consult the extension method. was (Author: lukeliush): Notes: The attached cbor file contains both magic bytes for type xhtml and type cbor, with priority 40 on application/cbor, we will have the following issues Problem1: Magic priority 40. The application/xhtml+xml has higher priority(50) than application/cbor (40); [I don't know who (and why) assigned 40 to cbor]; So if xhtml gets read and compared first, cbor will not even be placed in the magic estimation list because it has low priority. Based on the tests, it turns out that it is true that xhtml gets read and compared first with the input file, so any type below the priority 50 will be disregarded. Problem2: again magic priority with 50. In Tika, given a file dumped by the nutch dumper tool, both types (xhtml and cbor) will be selected as candidate mime types and they will be put in the magic estimation list; since xhtml type gets read first, it is placed atop the cbor; in order to break that tie, tika will rely on the decision from the extension method. If the extension method fails to detect the type(for now, let's ignore metadata hint method for simplicity but the same applies to it too), then xhtml gets returned eventually. My pull request to be sent : I am going to set the magic priority of cbor type to 50 the same as xhtml, because it would probably be risky to discard any one of the estimated types without going consult the extension method. CBOR Parser and detection [improvement] --- Key: TIKA-1610 URL: https://issues.apache.org/jira/browse/TIKA-1610 Project: Tika Issue Type: New Feature Components: detector, mime, parser Affects Versions: 1.7 Reporter: Luke sh Assignee: Chris A. Mattmann Priority: Trivial Labels: memex Attachments: 142440269.html, NUTCH-1997.cbor, cbor_tika.mimetypes.xml.jpg, rfc_cbor.jpg CBOR is a data format whose design goals include the possibility of extremely small code size, fairly small message size, and extensibility without the need for version negotiation (cited from http://cbor.io/ ). It would be great if Tika is able to provide the support with CBOR parser and identification. In the current project with Nutch, the Nutch CommonCrawlDataDumper is used to dump the crawled segments to the files in the format of CBOR. In order to read/parse those dumped files by this tool, it would be great if tika is able to support parsing the cbor, the thing is that the CommonCrawlDataDumper is not dumping with correct extension, it dumps with its own rule, the default extension of the dumped file is html, so it might be less painful if tika is able to detect and parse those files without any pre-processing steps. CommonCrawlDataDumper is calling the following to dump with cbor. import com.fasterxml.jackson.dataformat.cbor.CBORFactory; import com.fasterxml.jackson.dataformat.cbor.CBORGenerator; fasterxml is a 3rd party library for converting json to .cbor and Vice Versa. According to RFC 7049 (http://tools.ietf.org/html/rfc7049), it looks like CBOR does not yet have its magic numbers to be detected/identified by other applications (PFA: rfc_cbor.jpg) It seems that
[jira] [Commented] (TIKA-1610) CBOR Parser and detection [improvement]
[ https://issues.apache.org/jira/browse/TIKA-1610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14510382#comment-14510382 ] Luke sh commented on TIKA-1610: --- Notes: The attached cbor file contains both magic bytes for type xhtml and type cbor, with priority 40 on application/cbor, we will have the following issues Problem1: Magic priority 40. The application/xhtml+xml has higher priority(50) than application/cbor (40); [I don't know who (and why) assigned 40 to cbor]; So if xhtml gets read and compared first, cbor will not even be placed in the magic estimation list because it has low priority. Based on the tests, it turns out that it is true that xhtml gets read and compared first with the input file, so any type below the priority 50 will be disregarded. Problem2: again magic priority with 50. In Tika, given a file dumped by the nutch dumper tool, both types (xhtml and cbor) will be selected as candidate mime types and they will be put in the magic estimation list; since xhtml type gets read first, it is placed atop the cbor; in order to break that tie, tika will rely on the decision from the extension method. If the extension method fails to detect the type(for now, let's ignore metadata hint method for simplicity but the same applies to it too), then xhtml gets returned eventually. My pull request to be sent : I am going to set the magic priority of cbor type to 50 the same as xhtml, because it would probably be risky to discard any one of the estimated types without going consult the extension method. CBOR Parser and detection [improvement] --- Key: TIKA-1610 URL: https://issues.apache.org/jira/browse/TIKA-1610 Project: Tika Issue Type: New Feature Components: detector, mime, parser Affects Versions: 1.7 Reporter: Luke sh Assignee: Chris A. Mattmann Priority: Trivial Labels: memex Attachments: 142440269.html, NUTCH-1997.cbor, cbor_tika.mimetypes.xml.jpg, rfc_cbor.jpg CBOR is a data format whose design goals include the possibility of extremely small code size, fairly small message size, and extensibility without the need for version negotiation (cited from http://cbor.io/ ). It would be great if Tika is able to provide the support with CBOR parser and identification. In the current project with Nutch, the Nutch CommonCrawlDataDumper is used to dump the crawled segments to the files in the format of CBOR. In order to read/parse those dumped files by this tool, it would be great if tika is able to support parsing the cbor, the thing is that the CommonCrawlDataDumper is not dumping with correct extension, it dumps with its own rule, the default extension of the dumped file is html, so it might be less painful if tika is able to detect and parse those files without any pre-processing steps. CommonCrawlDataDumper is calling the following to dump with cbor. import com.fasterxml.jackson.dataformat.cbor.CBORFactory; import com.fasterxml.jackson.dataformat.cbor.CBORGenerator; fasterxml is a 3rd party library for converting json to .cbor and Vice Versa. According to RFC 7049 (http://tools.ietf.org/html/rfc7049), it looks like CBOR does not yet have its magic numbers to be detected/identified by other applications (PFA: rfc_cbor.jpg) It seems that the only way to inform other applications of the type as of now is using the extension (i.e. .cbor), or probably content detection (i.e. byte histogram distribution estimation). There is another thing worth the attention, it looks like tika has attempted to add the support with cbor mime detection in the tika-mimetypes.xml (PFA:cbor_tika.mimetypes.xml.jpg); This detection is not working with the cbor file dumped by CommonCrawlDataDumper. According to http://tools.ietf.org/html/rfc7049#section-2.4.5, there is a self-describing Tag 55799 that seems to be used for cbor type identification(the hex code might be 0xd9d9f7), but it is probably up to the application that take care of this tag, and it is also possible that the fasterxml that the nutch dumping tool is missing this tag, an example cbor file dumped by the Nutch tool i.e. CommonCrawlDataDumper has also been attached (PFA: 142440269.html). The following info is cited from the rfc, ...a decoder might be able to parse both CBOR and JSON. Such a decoder would need to mechanically distinguish the two formats. An easy way for an encoder to help the decoder would be to tag the entire CBOR item with tag 55799, the serialization of which will never be found at the beginning of a JSON text... It looks like the a file can have two parts/sections i.e. the plain text parts and the json prettified by cbor, this might be also worth the attention and consideration with
[jira] [Updated] (TIKA-1610) CBOR Parser and detection improvement
[ https://issues.apache.org/jira/browse/TIKA-1610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luke sh updated TIKA-1610: -- Attachment: cbor_tika.mimetypes.xml.jpg rfc_cbor.jpg CBOR Parser and detection improvement - Key: TIKA-1610 URL: https://issues.apache.org/jira/browse/TIKA-1610 Project: Tika Issue Type: New Feature Components: detector, mime, parser Affects Versions: 1.7 Reporter: Luke sh Priority: Trivial Labels: memex Attachments: cbor_tika.mimetypes.xml.jpg, rfc_cbor.jpg CBOR is a data format whose design goals include the possibility of extremely small code size, fairly small message size, and extensibility without the need for version negotiation (cited from http://cbor.io/ ). It would be great if Tika is able to provide the support with CBOR parser and identification. In the current project with Nutch, the Nutch CommonCrawlDataDumper is used to dump the crawled segments to the files in the format of CBOR. In order to read/parse those dumped files by this tool, it would be great if tika is able to support parsing the cbor, the thing is that the CommonCrawlDataDumper is not dumping with correct extension, it dumps with its own rule, the default extension of the dumped file is html, so it might be less painful if tika is able to detect and parse those files without any pre-processing steps. CommonCrawlDataDumper is calling the following to dump with cbor. import com.fasterxml.jackson.dataformat.cbor.CBORFactory; import com.fasterxml.jackson.dataformat.cbor.CBORGenerator; fasterxml is a 3rd party library for converting json to .cbor and Vice Versa. According to RFC 7049 (http://tools.ietf.org/html/rfc7049), it looks like CBOR does not yet have its magic numbers to be detected/identified by other applications (PFA: rfc_cbor.jpg) It seems that the only way to inform other applications of the type as of now is using the extension (i.e. .cbor), or probably content detection (i.e. byte histogram distribution estimation). There is another thing worth the attention, it looks like tika has attempted to add the support with cbor mime detection in the tika-mimetypes.xml (PFA:cbor_tika.mimetypes.xml.jpg); This detection is not working with the cbor file dumped by CommonCrawlDataDumper. According to http://tools.ietf.org/html/rfc7049, there is a self-describing Tag 55799 that seems to be used for cbor type identification, but it is probably up to the application that take care of this tag, and it is also possible that the fasterxml is not missing this tag. On the other hand, it is worth noting that the entries for cbor extension detection needs to be appended in the tika-mimetypes.xml too glob pattern=*.cbor/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1610) CBOR Parser and detection improvement
[ https://issues.apache.org/jira/browse/TIKA-1610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luke sh updated TIKA-1610: -- Description: CBOR is a data format whose design goals include the possibility of extremely small code size, fairly small message size, and extensibility without the need for version negotiation (cited from http://cbor.io/ ). It would be great if Tika is able to provide the support with CBOR parser and identification. In the current project with Nutch, the Nutch CommonCrawlDataDumper is used to dump the crawled segments to the files in the format of CBOR. In order to read/parse those dumped files by this tool, it would be great if tika is able to support parsing the cbor, the thing is that the CommonCrawlDataDumper is not dumping with correct extension, it dumps with its own rule, the default extension of the dumped file is html, so it might be less painful if tika is able to detect and parse those files without any pre-processing steps. CommonCrawlDataDumper is calling the following to dump with cbor. import com.fasterxml.jackson.dataformat.cbor.CBORFactory; import com.fasterxml.jackson.dataformat.cbor.CBORGenerator; fasterxml is a 3rd party library for converting json to .cbor and Vice Versa. According to RFC 7049 (http://tools.ietf.org/html/rfc7049), it looks like CBOR does not yet have its magic numbers to be detected/identified by other applications (PFA: rfc_cbor.jpg) It seems that the only way to inform other applications of the type as of now is using the extension (i.e. .cbor), or probably content detection (i.e. byte histogram distribution estimation). There is another thing worth the attention, it looks like tika has attempted to add the support with cbor mime detection in the tika-mimetypes.xml (PFA:cbor_tika.mimetypes.xml.jpg); This detection is not working with the cbor file dumped by CommonCrawlDataDumper. According to http://tools.ietf.org/html/rfc7049, there is a self-describing Tag 55799 that seems to be used for cbor type identification, but it is probably up to the application that take care of this tag, and it is also possible that the fasterxml is not missing this tag. On the other hand, it is worth noting that the entries for cbor extension detection needs to be appended in the tika-mimetypes.xml too glob pattern=*.cbor/ was: CBOR is a data format whose design goals include the possibility of extremely small code size, fairly small message size, and extensibility without the need for version negotiation (cited from http://cbor.io/ ). It would be great if Tika is able to provide the support with CBOR parser and identification. In the current project with Nutch, the CommonCrawlDataDumper is a tool that comes with Nutch and it is used to dump the crawled segments to the files in the format of CBOR. In order to read/parse those dumped files by this tool, it would be great if tika is able to support the parsing and detecting, the surprise is that the CommonCrawlDataDumper is not dumping with correct extension, it dumps with its own rule, the default is html, so it might be less painful if tika is able to detect and parse those files without any pre-processing steps. CommonCrawlDataDumper is calling the following to dump with cbor. import com.fasterxml.jackson.dataformat.cbor.CBORFactory; import com.fasterxml.jackson.dataformat.cbor.CBORGenerator; fasterxml is a 3rd party library for converting json to .cbor and Vice Versa. According to RFC 7049 (http://tools.ietf.org/html/rfc7049), it looks like CBOR does not yet have its magic numbers to be detected/identified by other applications (PFA: rfc_cbor.jpg) It seems that the only way to inform other applications of the type as of now is using the extension (i.e. .cbor), or probably content detection (i.e. byte histogram distribution estimation). There is another thing worth the attention, it looks like tika has attempted to add the support with cbor mime detection in the tika-mimetypes.xml (PFA:cbor_tika.mimetypes.xml.jpg); This detection is not working with the cbor file dumped by CommonCrawlDataDumper. According to http://tools.ietf.org/html/rfc7049, there is a self-describing Tag 55799 that seems to be used for cbor type identification, but it is probably up to the application that take care of this tag, and it is also possible that the fasterxml is not missing this tag. On the other hand, it is worth noting that the entries for cbor extension detection needs to be appended in the tika-mimetypes.xml too glob pattern=*.cbor/ CBOR Parser and detection improvement - Key: TIKA-1610 URL: https://issues.apache.org/jira/browse/TIKA-1610 Project: Tika Issue Type: New Feature Components: detector, mime, parser Affects Versions: 1.7 Reporter: Luke sh Priority:
[jira] [Created] (TIKA-1610) CBOR Parser and detection improvement
Luke sh created TIKA-1610: - Summary: CBOR Parser and detection improvement Key: TIKA-1610 URL: https://issues.apache.org/jira/browse/TIKA-1610 Project: Tika Issue Type: New Feature Components: detector, mime, parser Affects Versions: 1.7 Reporter: Luke sh Priority: Trivial CBOR is a data format whose design goals include the possibility of extremely small code size, fairly small message size, and extensibility without the need for version negotiation (cited from http://cbor.io/ ). It would be great if Tika is able to provide the support with CBOR parser and identification. In the current project with Nutch, the CommonCrawlDataDumper is a tool that comes with Nutch and it is used to dump the crawled segments to the files in the format of CBOR. In order to read/parse those dumped files by this tool, it would be great if tika is able to support the parsing and detecting, the surprise is that the CommonCrawlDataDumper is not dumping with correct extension, it dumps with its own rule, the default is html, so it might be less painful if tika is able to detect and parse those files without any pre-processing steps. CommonCrawlDataDumper is calling the following to dump with cbor. import com.fasterxml.jackson.dataformat.cbor.CBORFactory; import com.fasterxml.jackson.dataformat.cbor.CBORGenerator; fasterxml is a 3rd party library for converting json to .cbor and Vice Versa. According to RFC 7049 (http://tools.ietf.org/html/rfc7049), it looks like CBOR does not yet have its magic numbers to be detected/identified by other applications (PFA: rfc_cbor.jpg) It seems that the only way to inform other applications of the type as of now is using the extension (i.e. .cbor), or probably content detection (i.e. byte histogram distribution estimation). There is another thing worth the attention, it looks like tika has attempted to add the support with cbor mime detection in the tika-mimetypes.xml (PFA:cbor_tika.mimetypes.xml.jpg); This detection is not working with the cbor file dumped by CommonCrawlDataDumper. According to http://tools.ietf.org/html/rfc7049, there is a self-describing Tag 55799 that seems to be used for cbor type identification, but it is probably up to the application that take care of this tag, and it is also possible that the fasterxml is not missing this tag. On the other hand, it is worth noting that the entries for cbor extension detection needs to be appended in the tika-mimetypes.xml too glob pattern=*.cbor/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1610) CBOR Parser and detection improvement
[ https://issues.apache.org/jira/browse/TIKA-1610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luke sh updated TIKA-1610: -- Description: CBOR is a data format whose design goals include the possibility of extremely small code size, fairly small message size, and extensibility without the need for version negotiation (cited from http://cbor.io/ ). It would be great if Tika is able to provide the support with CBOR parser and identification. In the current project with Nutch, the Nutch CommonCrawlDataDumper is used to dump the crawled segments to the files in the format of CBOR. In order to read/parse those dumped files by this tool, it would be great if tika is able to support parsing the cbor, the thing is that the CommonCrawlDataDumper is not dumping with correct extension, it dumps with its own rule, the default extension of the dumped file is html, so it might be less painful if tika is able to detect and parse those files without any pre-processing steps. CommonCrawlDataDumper is calling the following to dump with cbor. import com.fasterxml.jackson.dataformat.cbor.CBORFactory; import com.fasterxml.jackson.dataformat.cbor.CBORGenerator; fasterxml is a 3rd party library for converting json to .cbor and Vice Versa. According to RFC 7049 (http://tools.ietf.org/html/rfc7049), it looks like CBOR does not yet have its magic numbers to be detected/identified by other applications (PFA: rfc_cbor.jpg) It seems that the only way to inform other applications of the type as of now is using the extension (i.e. .cbor), or probably content detection (i.e. byte histogram distribution estimation). There is another thing worth the attention, it looks like tika has attempted to add the support with cbor mime detection in the tika-mimetypes.xml (PFA:cbor_tika.mimetypes.xml.jpg); This detection is not working with the cbor file dumped by CommonCrawlDataDumper. According to http://tools.ietf.org/html/rfc7049#section-2.4.5, there is a self-describing Tag 55799 that seems to be used for cbor type identification(the hex code might be 0xd9d9f7), but it is probably up to the application that take care of this tag, and it is also possible that the fasterxml that the nutch dumping tool is missing this tag, an example cbor file dumped by the Nutch tool i.e. CommonCrawlDataDumper has also been attached (PFA: 142440269.html). The following info is cited from the rfc, ...a decoder might be able to parse both CBOR and JSON. Such a decoder would need to mechanically distinguish the two formats. An easy way for an encoder to help the decoder would be to tag the entire CBOR item with tag 55799, the serialization of which will never be found at the beginning of a JSON text... It looks like the a file can have two parts/sections i.e. the plain text parts and the json prettified by cbor, this might be also worth the attention and consideration with the parsing and type identification. On the other hand, it is worth noting that the entries for cbor extension detection needs to be appended in the tika-mimetypes.xml too e.g. glob pattern=*.cbor/ was: CBOR is a data format whose design goals include the possibility of extremely small code size, fairly small message size, and extensibility without the need for version negotiation (cited from http://cbor.io/ ). It would be great if Tika is able to provide the support with CBOR parser and identification. In the current project with Nutch, the Nutch CommonCrawlDataDumper is used to dump the crawled segments to the files in the format of CBOR. In order to read/parse those dumped files by this tool, it would be great if tika is able to support parsing the cbor, the thing is that the CommonCrawlDataDumper is not dumping with correct extension, it dumps with its own rule, the default extension of the dumped file is html, so it might be less painful if tika is able to detect and parse those files without any pre-processing steps. CommonCrawlDataDumper is calling the following to dump with cbor. import com.fasterxml.jackson.dataformat.cbor.CBORFactory; import com.fasterxml.jackson.dataformat.cbor.CBORGenerator; fasterxml is a 3rd party library for converting json to .cbor and Vice Versa. According to RFC 7049 (http://tools.ietf.org/html/rfc7049), it looks like CBOR does not yet have its magic numbers to be detected/identified by other applications (PFA: rfc_cbor.jpg) It seems that the only way to inform other applications of the type as of now is using the extension (i.e. .cbor), or probably content detection (i.e. byte histogram distribution estimation). There is another thing worth the attention, it looks like tika has attempted to add the support with cbor mime detection in the tika-mimetypes.xml (PFA:cbor_tika.mimetypes.xml.jpg); This detection is not working with the cbor file dumped by CommonCrawlDataDumper. According to
[jira] [Updated] (TIKA-1610) CBOR Parser and detection improvement
[ https://issues.apache.org/jira/browse/TIKA-1610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luke sh updated TIKA-1610: -- Description: CBOR is a data format whose design goals include the possibility of extremely small code size, fairly small message size, and extensibility without the need for version negotiation (cited from http://cbor.io/ ). It would be great if Tika is able to provide the support with CBOR parser and identification. In the current project with Nutch, the Nutch CommonCrawlDataDumper is used to dump the crawled segments to the files in the format of CBOR. In order to read/parse those dumped files by this tool, it would be great if tika is able to support parsing the cbor, the thing is that the CommonCrawlDataDumper is not dumping with correct extension, it dumps with its own rule, the default extension of the dumped file is html, so it might be less painful if tika is able to detect and parse those files without any pre-processing steps. CommonCrawlDataDumper is calling the following to dump with cbor. import com.fasterxml.jackson.dataformat.cbor.CBORFactory; import com.fasterxml.jackson.dataformat.cbor.CBORGenerator; fasterxml is a 3rd party library for converting json to .cbor and Vice Versa. According to RFC 7049 (http://tools.ietf.org/html/rfc7049), it looks like CBOR does not yet have its magic numbers to be detected/identified by other applications (PFA: rfc_cbor.jpg) It seems that the only way to inform other applications of the type as of now is using the extension (i.e. .cbor), or probably content detection (i.e. byte histogram distribution estimation). There is another thing worth the attention, it looks like tika has attempted to add the support with cbor mime detection in the tika-mimetypes.xml (PFA:cbor_tika.mimetypes.xml.jpg); This detection is not working with the cbor file dumped by CommonCrawlDataDumper. According to http://tools.ietf.org/html/rfc7049, there is a self-describing Tag 55799 that seems to be used for cbor type identification, but it is probably up to the application that take care of this tag, and it is also possible that the fasterxml is missing this tag, an example cbor file dumped by the Nutch tool i.e. CommonCrawlDataDumper has also been attached (PFA: 142440269.html) On the other hand, it is worth noting that the entries for cbor extension detection needs to be appended in the tika-mimetypes.xml too glob pattern=*.cbor/ was: CBOR is a data format whose design goals include the possibility of extremely small code size, fairly small message size, and extensibility without the need for version negotiation (cited from http://cbor.io/ ). It would be great if Tika is able to provide the support with CBOR parser and identification. In the current project with Nutch, the Nutch CommonCrawlDataDumper is used to dump the crawled segments to the files in the format of CBOR. In order to read/parse those dumped files by this tool, it would be great if tika is able to support parsing the cbor, the thing is that the CommonCrawlDataDumper is not dumping with correct extension, it dumps with its own rule, the default extension of the dumped file is html, so it might be less painful if tika is able to detect and parse those files without any pre-processing steps. CommonCrawlDataDumper is calling the following to dump with cbor. import com.fasterxml.jackson.dataformat.cbor.CBORFactory; import com.fasterxml.jackson.dataformat.cbor.CBORGenerator; fasterxml is a 3rd party library for converting json to .cbor and Vice Versa. According to RFC 7049 (http://tools.ietf.org/html/rfc7049), it looks like CBOR does not yet have its magic numbers to be detected/identified by other applications (PFA: rfc_cbor.jpg) It seems that the only way to inform other applications of the type as of now is using the extension (i.e. .cbor), or probably content detection (i.e. byte histogram distribution estimation). There is another thing worth the attention, it looks like tika has attempted to add the support with cbor mime detection in the tika-mimetypes.xml (PFA:cbor_tika.mimetypes.xml.jpg); This detection is not working with the cbor file dumped by CommonCrawlDataDumper. According to http://tools.ietf.org/html/rfc7049, there is a self-describing Tag 55799 that seems to be used for cbor type identification, but it is probably up to the application that take care of this tag, and it is also possible that the fasterxml is not missing this tag. On the other hand, it is worth noting that the entries for cbor extension detection needs to be appended in the tika-mimetypes.xml too glob pattern=*.cbor/ CBOR Parser and detection improvement - Key: TIKA-1610 URL: https://issues.apache.org/jira/browse/TIKA-1610 Project: Tika Issue Type: New Feature Components:
[jira] [Updated] (TIKA-1610) CBOR Parser and detection improvement
[ https://issues.apache.org/jira/browse/TIKA-1610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luke sh updated TIKA-1610: -- Description: CBOR is a data format whose design goals include the possibility of extremely small code size, fairly small message size, and extensibility without the need for version negotiation (cited from http://cbor.io/ ). It would be great if Tika is able to provide the support with CBOR parser and identification. In the current project with Nutch, the Nutch CommonCrawlDataDumper is used to dump the crawled segments to the files in the format of CBOR. In order to read/parse those dumped files by this tool, it would be great if tika is able to support parsing the cbor, the thing is that the CommonCrawlDataDumper is not dumping with correct extension, it dumps with its own rule, the default extension of the dumped file is html, so it might be less painful if tika is able to detect and parse those files without any pre-processing steps. CommonCrawlDataDumper is calling the following to dump with cbor. import com.fasterxml.jackson.dataformat.cbor.CBORFactory; import com.fasterxml.jackson.dataformat.cbor.CBORGenerator; fasterxml is a 3rd party library for converting json to .cbor and Vice Versa. According to RFC 7049 (http://tools.ietf.org/html/rfc7049), it looks like CBOR does not yet have its magic numbers to be detected/identified by other applications (PFA: rfc_cbor.jpg) It seems that the only way to inform other applications of the type as of now is using the extension (i.e. .cbor), or probably content detection (i.e. byte histogram distribution estimation). There is another thing worth the attention, it looks like tika has attempted to add the support with cbor mime detection in the tika-mimetypes.xml (PFA:cbor_tika.mimetypes.xml.jpg); This detection is not working with the cbor file dumped by CommonCrawlDataDumper. According to http://tools.ietf.org/html/rfc7049#section-2.4.5, there is a self-describing Tag 55799 that seems to be used for cbor type identification, but it is probably up to the application that take care of this tag, and it is also possible that the fasterxml is missing this tag, an example cbor file dumped by the Nutch tool i.e. CommonCrawlDataDumper has also been attached (PFA: 142440269.html) On the other hand, it is worth noting that the entries for cbor extension detection needs to be appended in the tika-mimetypes.xml too glob pattern=*.cbor/ was: CBOR is a data format whose design goals include the possibility of extremely small code size, fairly small message size, and extensibility without the need for version negotiation (cited from http://cbor.io/ ). It would be great if Tika is able to provide the support with CBOR parser and identification. In the current project with Nutch, the Nutch CommonCrawlDataDumper is used to dump the crawled segments to the files in the format of CBOR. In order to read/parse those dumped files by this tool, it would be great if tika is able to support parsing the cbor, the thing is that the CommonCrawlDataDumper is not dumping with correct extension, it dumps with its own rule, the default extension of the dumped file is html, so it might be less painful if tika is able to detect and parse those files without any pre-processing steps. CommonCrawlDataDumper is calling the following to dump with cbor. import com.fasterxml.jackson.dataformat.cbor.CBORFactory; import com.fasterxml.jackson.dataformat.cbor.CBORGenerator; fasterxml is a 3rd party library for converting json to .cbor and Vice Versa. According to RFC 7049 (http://tools.ietf.org/html/rfc7049), it looks like CBOR does not yet have its magic numbers to be detected/identified by other applications (PFA: rfc_cbor.jpg) It seems that the only way to inform other applications of the type as of now is using the extension (i.e. .cbor), or probably content detection (i.e. byte histogram distribution estimation). There is another thing worth the attention, it looks like tika has attempted to add the support with cbor mime detection in the tika-mimetypes.xml (PFA:cbor_tika.mimetypes.xml.jpg); This detection is not working with the cbor file dumped by CommonCrawlDataDumper. According to http://tools.ietf.org/html/rfc7049, there is a self-describing Tag 55799 that seems to be used for cbor type identification, but it is probably up to the application that take care of this tag, and it is also possible that the fasterxml is missing this tag, an example cbor file dumped by the Nutch tool i.e. CommonCrawlDataDumper has also been attached (PFA: 142440269.html) On the other hand, it is worth noting that the entries for cbor extension detection needs to be appended in the tika-mimetypes.xml too glob pattern=*.cbor/ CBOR Parser and detection improvement - Key: TIKA-1610 URL:
[jira] [Updated] (TIKA-1610) CBOR Parser and detection improvement
[ https://issues.apache.org/jira/browse/TIKA-1610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luke sh updated TIKA-1610: -- Attachment: 142440269.html cbor file dumped by the nutch tool. CBOR Parser and detection improvement - Key: TIKA-1610 URL: https://issues.apache.org/jira/browse/TIKA-1610 Project: Tika Issue Type: New Feature Components: detector, mime, parser Affects Versions: 1.7 Reporter: Luke sh Priority: Trivial Labels: memex Attachments: 142440269.html, cbor_tika.mimetypes.xml.jpg, rfc_cbor.jpg CBOR is a data format whose design goals include the possibility of extremely small code size, fairly small message size, and extensibility without the need for version negotiation (cited from http://cbor.io/ ). It would be great if Tika is able to provide the support with CBOR parser and identification. In the current project with Nutch, the Nutch CommonCrawlDataDumper is used to dump the crawled segments to the files in the format of CBOR. In order to read/parse those dumped files by this tool, it would be great if tika is able to support parsing the cbor, the thing is that the CommonCrawlDataDumper is not dumping with correct extension, it dumps with its own rule, the default extension of the dumped file is html, so it might be less painful if tika is able to detect and parse those files without any pre-processing steps. CommonCrawlDataDumper is calling the following to dump with cbor. import com.fasterxml.jackson.dataformat.cbor.CBORFactory; import com.fasterxml.jackson.dataformat.cbor.CBORGenerator; fasterxml is a 3rd party library for converting json to .cbor and Vice Versa. According to RFC 7049 (http://tools.ietf.org/html/rfc7049), it looks like CBOR does not yet have its magic numbers to be detected/identified by other applications (PFA: rfc_cbor.jpg) It seems that the only way to inform other applications of the type as of now is using the extension (i.e. .cbor), or probably content detection (i.e. byte histogram distribution estimation). There is another thing worth the attention, it looks like tika has attempted to add the support with cbor mime detection in the tika-mimetypes.xml (PFA:cbor_tika.mimetypes.xml.jpg); This detection is not working with the cbor file dumped by CommonCrawlDataDumper. According to http://tools.ietf.org/html/rfc7049, there is a self-describing Tag 55799 that seems to be used for cbor type identification, but it is probably up to the application that take care of this tag, and it is also possible that the fasterxml is missing this tag, an example cbor file dumped by the Nutch tool i.e. CommonCrawlDataDumper has also been attached (PFA: 142440269.html) On the other hand, it is worth noting that the entries for cbor extension detection needs to be appended in the tika-mimetypes.xml too glob pattern=*.cbor/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1610) CBOR Parser and detection [improvement]
[ https://issues.apache.org/jira/browse/TIKA-1610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luke sh updated TIKA-1610: -- Summary: CBOR Parser and detection [improvement] (was: CBOR Parser and detection improvement) CBOR Parser and detection [improvement] --- Key: TIKA-1610 URL: https://issues.apache.org/jira/browse/TIKA-1610 Project: Tika Issue Type: New Feature Components: detector, mime, parser Affects Versions: 1.7 Reporter: Luke sh Priority: Trivial Labels: memex Attachments: 142440269.html, cbor_tika.mimetypes.xml.jpg, rfc_cbor.jpg CBOR is a data format whose design goals include the possibility of extremely small code size, fairly small message size, and extensibility without the need for version negotiation (cited from http://cbor.io/ ). It would be great if Tika is able to provide the support with CBOR parser and identification. In the current project with Nutch, the Nutch CommonCrawlDataDumper is used to dump the crawled segments to the files in the format of CBOR. In order to read/parse those dumped files by this tool, it would be great if tika is able to support parsing the cbor, the thing is that the CommonCrawlDataDumper is not dumping with correct extension, it dumps with its own rule, the default extension of the dumped file is html, so it might be less painful if tika is able to detect and parse those files without any pre-processing steps. CommonCrawlDataDumper is calling the following to dump with cbor. import com.fasterxml.jackson.dataformat.cbor.CBORFactory; import com.fasterxml.jackson.dataformat.cbor.CBORGenerator; fasterxml is a 3rd party library for converting json to .cbor and Vice Versa. According to RFC 7049 (http://tools.ietf.org/html/rfc7049), it looks like CBOR does not yet have its magic numbers to be detected/identified by other applications (PFA: rfc_cbor.jpg) It seems that the only way to inform other applications of the type as of now is using the extension (i.e. .cbor), or probably content detection (i.e. byte histogram distribution estimation). There is another thing worth the attention, it looks like tika has attempted to add the support with cbor mime detection in the tika-mimetypes.xml (PFA:cbor_tika.mimetypes.xml.jpg); This detection is not working with the cbor file dumped by CommonCrawlDataDumper. According to http://tools.ietf.org/html/rfc7049#section-2.4.5, there is a self-describing Tag 55799 that seems to be used for cbor type identification(the hex code might be 0xd9d9f7), but it is probably up to the application that take care of this tag, and it is also possible that the fasterxml that the nutch dumping tool is missing this tag, an example cbor file dumped by the Nutch tool i.e. CommonCrawlDataDumper has also been attached (PFA: 142440269.html). The following info is cited from the rfc, ...a decoder might be able to parse both CBOR and JSON. Such a decoder would need to mechanically distinguish the two formats. An easy way for an encoder to help the decoder would be to tag the entire CBOR item with tag 55799, the serialization of which will never be found at the beginning of a JSON text... It looks like the a file can have two parts/sections i.e. the plain text parts and the json prettified by cbor, this might be also worth the attention and consideration with the parsing and type identification. On the other hand, it is worth noting that the entries for cbor extension detection needs to be appended in the tika-mimetypes.xml too e.g. glob pattern=*.cbor/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-1517) MIME type selection with probability
[ https://issues.apache.org/jira/browse/TIKA-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391476#comment-14391476 ] Luke sh edited comment on TIKA-1517 at 4/2/15 1:30 AM: --- After some research, it looks like the algorithm design with probabilistic mime type selection seems to cause some confusion vs Naive Bayesian, the idea is borrowed from Naive Bayesian, but it turns out giving up on some properties of Naive Bayesian seems to cause some confusions and therefore less intuitive although some computations are skimped. One of the problems i have been considering is that when magic test fails to determine a type, it returns byte-stream, in the original design, the original design approaches this by taking byte-stream as a decision, so e.g. when the extension test returns a correct type (e.g. GRB) but the magic test returns the byte-stream meaning it fails to detect the type, the question is whether we should take byte-stream as part of the decisions. After thinking about this this week for some time, i decide to ignore this byte-stream predicted by a test when taking the vote for which type should be used. e.g. magic test : Byte-Stream (failed to detect the type) extension test: GRB meta hint: none(Byte-Stream) The original design of the system in this case is expected to return Byte-Stream as the final decision for the type detection when magic test trust values(i.e. the presumed conditional probabilities) are set to high. secondly, after thinking for a while, i tend to think giving up on the prior might not be a good idea even though it simplifies a bit the computation, but that causes a bit of confusion and less intuitive. The intuition is that we probably can treat a detected type as a cause whose prior is 50% percent of correctness, i.e. 50% of chance that the detected type is correct, this seems to be more intuitive than completely ignoring the prior. After thinking about this problem for a while, it seems there are still some space where the algorithm can improved and optimized. Original design is intertwined with the consideration on computations which seems to cause some confusion, but actually the causal reasoning and intuition might be a bit more important, i will also be optimizing and correcting some of the factors which seems to be less appropriate in the original design. I am working on the improvement and conducting the research with pros and cons; if any thoughts, please kindly let me know. was (Author: lukeliush): After some research, it looks like the algorithm design with probabilistic mime type selection seems to cause some confusion vs Naive Bayesian, the idea is borrowed from Naive Bayesian, but it turns out giving up on some properties of Naive Bayesian seems to cause some confusions and therefore less intuitive although some computations are skimped. One of the problems i have been considering is that when magic test fails to determine a type, it returns byte-stream, in the original design, the original design approaches this by taking byte-stream as a decision, so e.g. when the extension test returns a correct type (e.g. GRB) but the magic test returns the byte-stream meaning it fails to detect the type, the question is whether we should take byte-stream as part of the decisions. After thinking about this this week for some time, i decide to ignore this byte-stream predicted by a test when taking the vote for which type should be used. e.g. magic test : Byte-Stream (failed to detect the type) extension test: GRB meta hint: none(Byte-Stream) The original design of the system in this case is expected to return Byte-Stream as the final decision for the type detection when magic test trust values(i.e. the presumed conditional probabilities) are set to high. secondly, after thinking for a while, i tend to think giving up on the prior might not be a good idea even though it simplifies a bit the computation, but that causes a bit of confusion and less intuitive. The intuition is that we probably can treat a detected type as a cause whose prior is 50% percent of correctness, i.e. 50% of chance that the detected type is correct, this seems to be more intuitive than completely ignoring the prior. After thinking about this problem for a while, it seems there are still some space where the algorithm can improved and optimized. Original design is intertwined with the consideration on computations which seems to cause some confusion, but actually the causal reasoning and intuition might be a bit more important, i will also be optimizing and correcting some of the factors which seems to be less appropriate in the original design. I am working on the improvement; if any thoughts, please kindly let me know. MIME type selection with probability Key:
[jira] [Commented] (TIKA-1517) MIME type selection with probability
[ https://issues.apache.org/jira/browse/TIKA-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391476#comment-14391476 ] Luke sh commented on TIKA-1517: --- After some research, it looks like the algorithm design with probabilistic mime type selection seems to be cause some confusion vs Naive Bayesian, the idea is borrowed from Naive Bayesian, but it turns out giving up on some properties of Naive Bayesian seems to cause some confusions and therefore less intuitive although some computations are skimped. One of the problems i have been considering is that when magic test fails to determine a type, it returns byte-stream, in the original design, the original design approaches this by taking byte-stream as a decision, so e.g. when the extension test returns a correct type (e.g. GRB) but the magic test returns the byte-stream meaning it fails to detect the type, the question is should we take byte-stream as part of the decisions. After thinking about this this week for some time, i decide to ignore this byte-stream predicted by a test when taking vote for decision on which type should be used. e.g. magic test : Byte-Stream extension test: GRB meta hint: none(Byte-Stream) The original design of the system is expected to return Byte-Stream as the final decision for the type detection when magic test trust values(i.e. the presumed conditional probabilities) are set to high. secondly, after thinking for a while, i tend to think giving up on the prior might not be a good idea even though it simplifies a bit the computation, but that causes a bit of confusion and less intuitive. The intuition is that we probably can treat a detected type as a cause whose prior is 50% percent of correctness, i.e. 50% of chance that the detected type is correct, this seems to be more intuitive than completely ignoring the prior. After thinking about this problem for a while, it seems there are still some space to be corrected and optimized. Original design is intertwined with the consideration on computations which seems to cause some confusion, but actually the causal reasoning and intuition might be a bit more important, i will also be optimizing and correcting some of the factors which seems to be less appropriate in the original design. I am working on the improvement; if any thoughts, please kindly let me know. MIME type selection with probability Key: TIKA-1517 URL: https://issues.apache.org/jira/browse/TIKA-1517 Project: Tika Issue Type: Improvement Components: mime Affects Versions: 0.1-incubating, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.10, 1.0, 1.1, 1.2, 1.3, 1.4, 1.5, 1.6 Reporter: Luke sh Attachments: BaysianTest.java Improvement and intuition The original implementation for MIME type selection/detection is a bit less flexible by initial design, as it heavily relies on the outcome produced by magic-bytes MIME Type identification; Thus e.g. if magic-bytes is applicable in a file, Tika will follow the file type detected by magic-bytes. It may be better to provide more control over the method of choice. This proposed approach slightly incorporate the Bayesian probability theorem, where users are able to assign weights to each approach in terms of probability, so they have the control over preference of which file type or mime type identification methods implemented/available in Tika, and currently there are 3 methods for identifying MIME type in Tika (i.e. Magic-Bytes, File extension and Metadata content-type hint). By introducing some weights on the approach in the proposed approach, users are able to choose which method they trust most, the magic-bytes method is often trust-worthy though. But the virtue is that in some situations, file type identification must be sensitive, some might want all of the MIME type identification methods to agree on the same file type before they start processing those files, incorrect file type identification is less intolerable. The current implementation seems to be less flexible for this purpose and heavily rely on the Magic-bytes file identification method (although magic-bytes is most reliable compared to the other 2 ); Proposed design: The idea of selection is to incorporate probability as weights on each MIME type identification method currently being implemented in Tika (they are Magic bytes approach, file extension match and metadata content-type hint). for example, as an user, i would probably like to assign the the preference to the method based on the degree of the trust, and order the results if they don't coincide. Bayesian rule may be a bit appropriate here to meet the intuition. The following is what are needed for Bayesian rule implementation. Prior probability
[jira] [Comment Edited] (TIKA-1517) MIME type selection with probability
[ https://issues.apache.org/jira/browse/TIKA-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391476#comment-14391476 ] Luke sh edited comment on TIKA-1517 at 4/1/15 10:38 PM: After some research, it looks like the algorithm design with probabilistic mime type selection seems to cause some confusion vs Naive Bayesian, the idea is borrowed from Naive Bayesian, but it turns out giving up on some properties of Naive Bayesian seems to cause some confusions and therefore less intuitive although some computations are skimped. One of the problems i have been considering is that when magic test fails to determine a type, it returns byte-stream, in the original design, the original design approaches this by taking byte-stream as a decision, so e.g. when the extension test returns a correct type (e.g. GRB) but the magic test returns the byte-stream meaning it fails to detect the type, the question is whether we should take byte-stream as part of the decisions. After thinking about this this week for some time, i decide to ignore this byte-stream predicted by a test when taking vote for decision on which type should be used. e.g. magic test : Byte-Stream extension test: GRB meta hint: none(Byte-Stream) The original design of the system is expected to return Byte-Stream as the final decision for the type detection when magic test trust values(i.e. the presumed conditional probabilities) are set to high. secondly, after thinking for a while, i tend to think giving up on the prior might not be a good idea even though it simplifies a bit the computation, but that causes a bit of confusion and less intuitive. The intuition is that we probably can treat a detected type as a cause whose prior is 50% percent of correctness, i.e. 50% of chance that the detected type is correct, this seems to be more intuitive than completely ignoring the prior. After thinking about this problem for a while, it seems there are still some space to be corrected and optimized. Original design is intertwined with the consideration on computations which seems to cause some confusion, but actually the causal reasoning and intuition might be a bit more important, i will also be optimizing and correcting some of the factors which seems to be less appropriate in the original design. I am working on the improvement; if any thoughts, please kindly let me know. was (Author: lukeliush): After some research, it looks like the algorithm design with probabilistic mime type selection seems to cause some confusion vs Naive Bayesian, the idea is borrowed from Naive Bayesian, but it turns out giving up on some properties of Naive Bayesian seems to cause some confusions and therefore less intuitive although some computations are skimped. One of the problems i have been considering is that when magic test fails to determine a type, it returns byte-stream, in the original design, the original design approaches this by taking byte-stream as a decision, so e.g. when the extension test returns a correct type (e.g. GRB) but the magic test returns the byte-stream meaning it fails to detect the type, the question is should we take byte-stream as part of the decisions. After thinking about this this week for some time, i decide to ignore this byte-stream predicted by a test when taking vote for decision on which type should be used. e.g. magic test : Byte-Stream extension test: GRB meta hint: none(Byte-Stream) The original design of the system is expected to return Byte-Stream as the final decision for the type detection when magic test trust values(i.e. the presumed conditional probabilities) are set to high. secondly, after thinking for a while, i tend to think giving up on the prior might not be a good idea even though it simplifies a bit the computation, but that causes a bit of confusion and less intuitive. The intuition is that we probably can treat a detected type as a cause whose prior is 50% percent of correctness, i.e. 50% of chance that the detected type is correct, this seems to be more intuitive than completely ignoring the prior. After thinking about this problem for a while, it seems there are still some space to be corrected and optimized. Original design is intertwined with the consideration on computations which seems to cause some confusion, but actually the causal reasoning and intuition might be a bit more important, i will also be optimizing and correcting some of the factors which seems to be less appropriate in the original design. I am working on the improvement; if any thoughts, please kindly let me know. MIME type selection with probability Key: TIKA-1517 URL: https://issues.apache.org/jira/browse/TIKA-1517 Project: Tika Issue Type: Improvement Components:
[jira] [Comment Edited] (TIKA-1517) MIME type selection with probability
[ https://issues.apache.org/jira/browse/TIKA-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391476#comment-14391476 ] Luke sh edited comment on TIKA-1517 at 4/1/15 10:37 PM: After some research, it looks like the algorithm design with probabilistic mime type selection seems to cause some confusion vs Naive Bayesian, the idea is borrowed from Naive Bayesian, but it turns out giving up on some properties of Naive Bayesian seems to cause some confusions and therefore less intuitive although some computations are skimped. One of the problems i have been considering is that when magic test fails to determine a type, it returns byte-stream, in the original design, the original design approaches this by taking byte-stream as a decision, so e.g. when the extension test returns a correct type (e.g. GRB) but the magic test returns the byte-stream meaning it fails to detect the type, the question is should we take byte-stream as part of the decisions. After thinking about this this week for some time, i decide to ignore this byte-stream predicted by a test when taking vote for decision on which type should be used. e.g. magic test : Byte-Stream extension test: GRB meta hint: none(Byte-Stream) The original design of the system is expected to return Byte-Stream as the final decision for the type detection when magic test trust values(i.e. the presumed conditional probabilities) are set to high. secondly, after thinking for a while, i tend to think giving up on the prior might not be a good idea even though it simplifies a bit the computation, but that causes a bit of confusion and less intuitive. The intuition is that we probably can treat a detected type as a cause whose prior is 50% percent of correctness, i.e. 50% of chance that the detected type is correct, this seems to be more intuitive than completely ignoring the prior. After thinking about this problem for a while, it seems there are still some space to be corrected and optimized. Original design is intertwined with the consideration on computations which seems to cause some confusion, but actually the causal reasoning and intuition might be a bit more important, i will also be optimizing and correcting some of the factors which seems to be less appropriate in the original design. I am working on the improvement; if any thoughts, please kindly let me know. was (Author: lukeliush): After some research, it looks like the algorithm design with probabilistic mime type selection seems to be cause some confusion vs Naive Bayesian, the idea is borrowed from Naive Bayesian, but it turns out giving up on some properties of Naive Bayesian seems to cause some confusions and therefore less intuitive although some computations are skimped. One of the problems i have been considering is that when magic test fails to determine a type, it returns byte-stream, in the original design, the original design approaches this by taking byte-stream as a decision, so e.g. when the extension test returns a correct type (e.g. GRB) but the magic test returns the byte-stream meaning it fails to detect the type, the question is should we take byte-stream as part of the decisions. After thinking about this this week for some time, i decide to ignore this byte-stream predicted by a test when taking vote for decision on which type should be used. e.g. magic test : Byte-Stream extension test: GRB meta hint: none(Byte-Stream) The original design of the system is expected to return Byte-Stream as the final decision for the type detection when magic test trust values(i.e. the presumed conditional probabilities) are set to high. secondly, after thinking for a while, i tend to think giving up on the prior might not be a good idea even though it simplifies a bit the computation, but that causes a bit of confusion and less intuitive. The intuition is that we probably can treat a detected type as a cause whose prior is 50% percent of correctness, i.e. 50% of chance that the detected type is correct, this seems to be more intuitive than completely ignoring the prior. After thinking about this problem for a while, it seems there are still some space to be corrected and optimized. Original design is intertwined with the consideration on computations which seems to cause some confusion, but actually the causal reasoning and intuition might be a bit more important, i will also be optimizing and correcting some of the factors which seems to be less appropriate in the original design. I am working on the improvement; if any thoughts, please kindly let me know. MIME type selection with probability Key: TIKA-1517 URL: https://issues.apache.org/jira/browse/TIKA-1517 Project: Tika Issue Type: Improvement Components: mime
[jira] [Comment Edited] (TIKA-1517) MIME type selection with probability
[ https://issues.apache.org/jira/browse/TIKA-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391476#comment-14391476 ] Luke sh edited comment on TIKA-1517 at 4/1/15 10:39 PM: After some research, it looks like the algorithm design with probabilistic mime type selection seems to cause some confusion vs Naive Bayesian, the idea is borrowed from Naive Bayesian, but it turns out giving up on some properties of Naive Bayesian seems to cause some confusions and therefore less intuitive although some computations are skimped. One of the problems i have been considering is that when magic test fails to determine a type, it returns byte-stream, in the original design, the original design approaches this by taking byte-stream as a decision, so e.g. when the extension test returns a correct type (e.g. GRB) but the magic test returns the byte-stream meaning it fails to detect the type, the question is whether we should take byte-stream as part of the decisions. After thinking about this this week for some time, i decide to ignore this byte-stream predicted by a test when taking the vote for which type should be used. e.g. magic test : Byte-Stream extension test: GRB meta hint: none(Byte-Stream) The original design of the system is expected to return Byte-Stream as the final decision for the type detection when magic test trust values(i.e. the presumed conditional probabilities) are set to high. secondly, after thinking for a while, i tend to think giving up on the prior might not be a good idea even though it simplifies a bit the computation, but that causes a bit of confusion and less intuitive. The intuition is that we probably can treat a detected type as a cause whose prior is 50% percent of correctness, i.e. 50% of chance that the detected type is correct, this seems to be more intuitive than completely ignoring the prior. After thinking about this problem for a while, it seems there are still some space to be corrected and optimized. Original design is intertwined with the consideration on computations which seems to cause some confusion, but actually the causal reasoning and intuition might be a bit more important, i will also be optimizing and correcting some of the factors which seems to be less appropriate in the original design. I am working on the improvement; if any thoughts, please kindly let me know. was (Author: lukeliush): After some research, it looks like the algorithm design with probabilistic mime type selection seems to cause some confusion vs Naive Bayesian, the idea is borrowed from Naive Bayesian, but it turns out giving up on some properties of Naive Bayesian seems to cause some confusions and therefore less intuitive although some computations are skimped. One of the problems i have been considering is that when magic test fails to determine a type, it returns byte-stream, in the original design, the original design approaches this by taking byte-stream as a decision, so e.g. when the extension test returns a correct type (e.g. GRB) but the magic test returns the byte-stream meaning it fails to detect the type, the question is whether we should take byte-stream as part of the decisions. After thinking about this this week for some time, i decide to ignore this byte-stream predicted by a test when taking vote for decision on which type should be used. e.g. magic test : Byte-Stream extension test: GRB meta hint: none(Byte-Stream) The original design of the system is expected to return Byte-Stream as the final decision for the type detection when magic test trust values(i.e. the presumed conditional probabilities) are set to high. secondly, after thinking for a while, i tend to think giving up on the prior might not be a good idea even though it simplifies a bit the computation, but that causes a bit of confusion and less intuitive. The intuition is that we probably can treat a detected type as a cause whose prior is 50% percent of correctness, i.e. 50% of chance that the detected type is correct, this seems to be more intuitive than completely ignoring the prior. After thinking about this problem for a while, it seems there are still some space to be corrected and optimized. Original design is intertwined with the consideration on computations which seems to cause some confusion, but actually the causal reasoning and intuition might be a bit more important, i will also be optimizing and correcting some of the factors which seems to be less appropriate in the original design. I am working on the improvement; if any thoughts, please kindly let me know. MIME type selection with probability Key: TIKA-1517 URL: https://issues.apache.org/jira/browse/TIKA-1517 Project: Tika Issue Type: Improvement Components:
[jira] [Comment Edited] (TIKA-1517) MIME type selection with probability
[ https://issues.apache.org/jira/browse/TIKA-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391476#comment-14391476 ] Luke sh edited comment on TIKA-1517 at 4/1/15 10:39 PM: After some research, it looks like the algorithm design with probabilistic mime type selection seems to cause some confusion vs Naive Bayesian, the idea is borrowed from Naive Bayesian, but it turns out giving up on some properties of Naive Bayesian seems to cause some confusions and therefore less intuitive although some computations are skimped. One of the problems i have been considering is that when magic test fails to determine a type, it returns byte-stream, in the original design, the original design approaches this by taking byte-stream as a decision, so e.g. when the extension test returns a correct type (e.g. GRB) but the magic test returns the byte-stream meaning it fails to detect the type, the question is whether we should take byte-stream as part of the decisions. After thinking about this this week for some time, i decide to ignore this byte-stream predicted by a test when taking the vote for which type should be used. e.g. magic test : Byte-Stream (failed to detect the type) extension test: GRB meta hint: none(Byte-Stream) The original design of the system is expected to return Byte-Stream as the final decision for the type detection when magic test trust values(i.e. the presumed conditional probabilities) are set to high. secondly, after thinking for a while, i tend to think giving up on the prior might not be a good idea even though it simplifies a bit the computation, but that causes a bit of confusion and less intuitive. The intuition is that we probably can treat a detected type as a cause whose prior is 50% percent of correctness, i.e. 50% of chance that the detected type is correct, this seems to be more intuitive than completely ignoring the prior. After thinking about this problem for a while, it seems there are still some space to be corrected and optimized. Original design is intertwined with the consideration on computations which seems to cause some confusion, but actually the causal reasoning and intuition might be a bit more important, i will also be optimizing and correcting some of the factors which seems to be less appropriate in the original design. I am working on the improvement; if any thoughts, please kindly let me know. was (Author: lukeliush): After some research, it looks like the algorithm design with probabilistic mime type selection seems to cause some confusion vs Naive Bayesian, the idea is borrowed from Naive Bayesian, but it turns out giving up on some properties of Naive Bayesian seems to cause some confusions and therefore less intuitive although some computations are skimped. One of the problems i have been considering is that when magic test fails to determine a type, it returns byte-stream, in the original design, the original design approaches this by taking byte-stream as a decision, so e.g. when the extension test returns a correct type (e.g. GRB) but the magic test returns the byte-stream meaning it fails to detect the type, the question is whether we should take byte-stream as part of the decisions. After thinking about this this week for some time, i decide to ignore this byte-stream predicted by a test when taking the vote for which type should be used. e.g. magic test : Byte-Stream extension test: GRB meta hint: none(Byte-Stream) The original design of the system is expected to return Byte-Stream as the final decision for the type detection when magic test trust values(i.e. the presumed conditional probabilities) are set to high. secondly, after thinking for a while, i tend to think giving up on the prior might not be a good idea even though it simplifies a bit the computation, but that causes a bit of confusion and less intuitive. The intuition is that we probably can treat a detected type as a cause whose prior is 50% percent of correctness, i.e. 50% of chance that the detected type is correct, this seems to be more intuitive than completely ignoring the prior. After thinking about this problem for a while, it seems there are still some space to be corrected and optimized. Original design is intertwined with the consideration on computations which seems to cause some confusion, but actually the causal reasoning and intuition might be a bit more important, i will also be optimizing and correcting some of the factors which seems to be less appropriate in the original design. I am working on the improvement; if any thoughts, please kindly let me know. MIME type selection with probability Key: TIKA-1517 URL: https://issues.apache.org/jira/browse/TIKA-1517 Project: Tika Issue Type: Improvement
[jira] [Comment Edited] (TIKA-1517) MIME type selection with probability
[ https://issues.apache.org/jira/browse/TIKA-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391476#comment-14391476 ] Luke sh edited comment on TIKA-1517 at 4/1/15 10:41 PM: After some research, it looks like the algorithm design with probabilistic mime type selection seems to cause some confusion vs Naive Bayesian, the idea is borrowed from Naive Bayesian, but it turns out giving up on some properties of Naive Bayesian seems to cause some confusions and therefore less intuitive although some computations are skimped. One of the problems i have been considering is that when magic test fails to determine a type, it returns byte-stream, in the original design, the original design approaches this by taking byte-stream as a decision, so e.g. when the extension test returns a correct type (e.g. GRB) but the magic test returns the byte-stream meaning it fails to detect the type, the question is whether we should take byte-stream as part of the decisions. After thinking about this this week for some time, i decide to ignore this byte-stream predicted by a test when taking the vote for which type should be used. e.g. magic test : Byte-Stream (failed to detect the type) extension test: GRB meta hint: none(Byte-Stream) The original design of the system in this case is expected to return Byte-Stream as the final decision for the type detection when magic test trust values(i.e. the presumed conditional probabilities) are set to high. secondly, after thinking for a while, i tend to think giving up on the prior might not be a good idea even though it simplifies a bit the computation, but that causes a bit of confusion and less intuitive. The intuition is that we probably can treat a detected type as a cause whose prior is 50% percent of correctness, i.e. 50% of chance that the detected type is correct, this seems to be more intuitive than completely ignoring the prior. After thinking about this problem for a while, it seems there are still some space where the algorithm can improved and optimized. Original design is intertwined with the consideration on computations which seems to cause some confusion, but actually the causal reasoning and intuition might be a bit more important, i will also be optimizing and correcting some of the factors which seems to be less appropriate in the original design. I am working on the improvement; if any thoughts, please kindly let me know. was (Author: lukeliush): After some research, it looks like the algorithm design with probabilistic mime type selection seems to cause some confusion vs Naive Bayesian, the idea is borrowed from Naive Bayesian, but it turns out giving up on some properties of Naive Bayesian seems to cause some confusions and therefore less intuitive although some computations are skimped. One of the problems i have been considering is that when magic test fails to determine a type, it returns byte-stream, in the original design, the original design approaches this by taking byte-stream as a decision, so e.g. when the extension test returns a correct type (e.g. GRB) but the magic test returns the byte-stream meaning it fails to detect the type, the question is whether we should take byte-stream as part of the decisions. After thinking about this this week for some time, i decide to ignore this byte-stream predicted by a test when taking the vote for which type should be used. e.g. magic test : Byte-Stream (failed to detect the type) extension test: GRB meta hint: none(Byte-Stream) The original design of the system in this case is expected to return Byte-Stream as the final decision for the type detection when magic test trust values(i.e. the presumed conditional probabilities) are set to high. secondly, after thinking for a while, i tend to think giving up on the prior might not be a good idea even though it simplifies a bit the computation, but that causes a bit of confusion and less intuitive. The intuition is that we probably can treat a detected type as a cause whose prior is 50% percent of correctness, i.e. 50% of chance that the detected type is correct, this seems to be more intuitive than completely ignoring the prior. After thinking about this problem for a while, it seems there are still some space to be corrected and optimized. Original design is intertwined with the consideration on computations which seems to cause some confusion, but actually the causal reasoning and intuition might be a bit more important, i will also be optimizing and correcting some of the factors which seems to be less appropriate in the original design. I am working on the improvement; if any thoughts, please kindly let me know. MIME type selection with probability Key: TIKA-1517 URL:
[jira] [Comment Edited] (TIKA-1517) MIME type selection with probability
[ https://issues.apache.org/jira/browse/TIKA-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391476#comment-14391476 ] Luke sh edited comment on TIKA-1517 at 4/1/15 10:39 PM: After some research, it looks like the algorithm design with probabilistic mime type selection seems to cause some confusion vs Naive Bayesian, the idea is borrowed from Naive Bayesian, but it turns out giving up on some properties of Naive Bayesian seems to cause some confusions and therefore less intuitive although some computations are skimped. One of the problems i have been considering is that when magic test fails to determine a type, it returns byte-stream, in the original design, the original design approaches this by taking byte-stream as a decision, so e.g. when the extension test returns a correct type (e.g. GRB) but the magic test returns the byte-stream meaning it fails to detect the type, the question is whether we should take byte-stream as part of the decisions. After thinking about this this week for some time, i decide to ignore this byte-stream predicted by a test when taking the vote for which type should be used. e.g. magic test : Byte-Stream (failed to detect the type) extension test: GRB meta hint: none(Byte-Stream) The original design of the system in this case is expected to return Byte-Stream as the final decision for the type detection when magic test trust values(i.e. the presumed conditional probabilities) are set to high. secondly, after thinking for a while, i tend to think giving up on the prior might not be a good idea even though it simplifies a bit the computation, but that causes a bit of confusion and less intuitive. The intuition is that we probably can treat a detected type as a cause whose prior is 50% percent of correctness, i.e. 50% of chance that the detected type is correct, this seems to be more intuitive than completely ignoring the prior. After thinking about this problem for a while, it seems there are still some space to be corrected and optimized. Original design is intertwined with the consideration on computations which seems to cause some confusion, but actually the causal reasoning and intuition might be a bit more important, i will also be optimizing and correcting some of the factors which seems to be less appropriate in the original design. I am working on the improvement; if any thoughts, please kindly let me know. was (Author: lukeliush): After some research, it looks like the algorithm design with probabilistic mime type selection seems to cause some confusion vs Naive Bayesian, the idea is borrowed from Naive Bayesian, but it turns out giving up on some properties of Naive Bayesian seems to cause some confusions and therefore less intuitive although some computations are skimped. One of the problems i have been considering is that when magic test fails to determine a type, it returns byte-stream, in the original design, the original design approaches this by taking byte-stream as a decision, so e.g. when the extension test returns a correct type (e.g. GRB) but the magic test returns the byte-stream meaning it fails to detect the type, the question is whether we should take byte-stream as part of the decisions. After thinking about this this week for some time, i decide to ignore this byte-stream predicted by a test when taking the vote for which type should be used. e.g. magic test : Byte-Stream (failed to detect the type) extension test: GRB meta hint: none(Byte-Stream) The original design of the system is expected to return Byte-Stream as the final decision for the type detection when magic test trust values(i.e. the presumed conditional probabilities) are set to high. secondly, after thinking for a while, i tend to think giving up on the prior might not be a good idea even though it simplifies a bit the computation, but that causes a bit of confusion and less intuitive. The intuition is that we probably can treat a detected type as a cause whose prior is 50% percent of correctness, i.e. 50% of chance that the detected type is correct, this seems to be more intuitive than completely ignoring the prior. After thinking about this problem for a while, it seems there are still some space to be corrected and optimized. Original design is intertwined with the consideration on computations which seems to cause some confusion, but actually the causal reasoning and intuition might be a bit more important, i will also be optimizing and correcting some of the factors which seems to be less appropriate in the original design. I am working on the improvement; if any thoughts, please kindly let me know. MIME type selection with probability Key: TIKA-1517 URL: https://issues.apache.org/jira/browse/TIKA-1517 Project:
[jira] [Updated] (TIKA-1517) MIME type selection with probability
[ https://issues.apache.org/jira/browse/TIKA-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luke sh updated TIKA-1517: -- Priority: Trivial (was: Major) MIME type selection with probability Key: TIKA-1517 URL: https://issues.apache.org/jira/browse/TIKA-1517 Project: Tika Issue Type: Improvement Components: mime Affects Versions: 0.1-incubating, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.10, 1.0, 1.1, 1.2, 1.3, 1.4, 1.5, 1.6 Reporter: Luke sh Priority: Trivial Attachments: BaysianTest.java Improvement and intuition The original implementation for MIME type selection/detection is a bit less flexible by initial design, as it heavily relies on the outcome produced by magic-bytes MIME Type identification; Thus e.g. if magic-bytes is applicable in a file, Tika will follow the file type detected by magic-bytes. It may be better to provide more control over the method of choice. This proposed approach slightly incorporate the Bayesian probability theorem, where users are able to assign weights to each approach in terms of probability, so they have the control over preference of which file type or mime type identification methods implemented/available in Tika, and currently there are 3 methods for identifying MIME type in Tika (i.e. Magic-Bytes, File extension and Metadata content-type hint). By introducing some weights on the approach in the proposed approach, users are able to choose which method they trust most, the magic-bytes method is often trust-worthy though. But the virtue is that in some situations, file type identification must be sensitive, some might want all of the MIME type identification methods to agree on the same file type before they start processing those files, incorrect file type identification is less intolerable. The current implementation seems to be less flexible for this purpose and heavily rely on the Magic-bytes file identification method (although magic-bytes is most reliable compared to the other 2 ); Proposed design: The idea of selection is to incorporate probability as weights on each MIME type identification method currently being implemented in Tika (they are Magic bytes approach, file extension match and metadata content-type hint). for example, as an user, i would probably like to assign the the preference to the method based on the degree of the trust, and order the results if they don't coincide. Bayesian rule may be a bit appropriate here to meet the intuition. The following is what are needed for Bayesian rule implementation. Prior probability P(file_type) e.g. P(pdf), theoretically this is computed based on the samples, and this depends on the domain or use cases, intuitively we more care the orders of the weights or probability of the results rather than the actual numbers, and also the context of Prior depends on samples for a particular use case or domain, e.g. if we happen to crawl a website that contains mostly the pdf files, we probably can collect some samples and compute the prior, based on the samples we can say 90% of docs are pdf, so our prior is defined to be P(pdf) = 0.9, but here we propose to define the prior as configurable param for users, and by default we leave the prior to be unapplicable. Alternatively, we can define prior for each file type to be 1/[number of supported file types in Tika] I think the number would be approximately 1/1157 and using this number seems to be more fair, but the point of avoiding it is that this prior is fixed for every type, and eventually we care more the orders of the result and if the number is fixed, so will the order be, bringing this number of 1/1157 into the Bayesian equation will not only be unable to affect the order but also it will lumber our implementation with extra computation, thus we will leave it as unapplicable which means we assign 1 to it as it never exists! but note we care more the order rather the actual number, and this param is configurable, and we believe it provides much flexibilities in some use cases. Conditional probability of positive tests given a file type P(test| file_type) e.g. P(test1 = pdf | pdf), this probability is also based on collection of samples and domain or use cases, we leave it configurable, but based on our intuition we think test1(i.e. Magic-bytes method) is most trustworthy, thus the default value is 0.75 for P(test1 = a_file_type | a_file_type), this is to say given the file whose type is a file type, the probability of the test1 predicting the file is a_file_type is 0.75, that is really our intuition, as we trust test1 most, next we propose to use 0.7 for test3, and 0.65 for test2; (note again, test1 = magic-bytes, test2
[jira] [Updated] (TIKA-1582) Mime Detection based on neural networks with Byte-frequency-histogram
[ https://issues.apache.org/jira/browse/TIKA-1582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luke sh updated TIKA-1582: -- Attachment: nnmodel.docx Documentation Mime Detection based on neural networks with Byte-frequency-histogram -- Key: TIKA-1582 URL: https://issues.apache.org/jira/browse/TIKA-1582 Project: Tika Issue Type: Improvement Components: detector, mime Affects Versions: 1.7 Reporter: Luke sh Priority: Trivial Attachments: nnmodel.docx Content-based mime type detection is one of the popular approaches to detect mime type, there are others based on file extension and magic numbers ; And currently Tika has implemented 3 approaches in detecting mime types; They are : 1) file extensions 2) magic numbers (the most trustworthy in tika) 3) content-type(the header in the http response if present and available) Content-based mime type detection however analyses the distribution of the entire stream of bytes and find a similar pattern for the same type and build a function that is able to group them into one or several classes so as to classify and predict; It is believed this feature might broaden the usage of Tika with a bit more security enforcement for mime type detection. Because we want to build a model that is etched with the patterns it has seen, in some situations we may not trust those types which have not been trained/learned by the model. In some situations, magic numbers imbedded in the files can be copied but the actual content could be a potentially detrimental Troy program. By enforcing the trust on byte frequency patterns, we are able to enhance the security of the detection. The proposed content-based mime detection to be integrated into Tika is based on the machine learning algorithm i.e. neural network with back-propagation. The input: 0-255 bins each of which represent a byte, and and each of which stores the count of occurrences for each byte, and the byte frequency histograms are normalized to fall in the range between 0 and 1, they then are passed to a companding function to enhancement the infrequent bytes. The output of the neural network is a binary decision 1 or 0; Notice BTW, the proposed feature will be implemented with GRB file type as one example. In this example, we build a model that is able to classify GRB file type from non-GRB file types, notice the size of non-GRB files is huge and cannot be easily defined, so there need to be as many negative training example as possible to form this non-GRB types decision boundary. The Neural networks is considered as two stage of processes. Training and classification. The training can be done in any programming language, in this feature /research, the training of neural network is implemented in R and the source can be found in my github repository i.e. https://github.com/LukeLiush/filetypeDetection; i am also going to post a document that describe the use of the program, the syntax/ format of the input and output. After training, we need to export the model and import it to Tika; in Tika, we create a TrainedModelDetector that reads this model file with one or more model parameters or several model files,so it can detect the mime types with the model of those mime types. Details of the research and usage with this proposed feature will be posted on my github shortly. It is worth noting again that in this research we only worked out one model - GRB as one example to demonstrate the use of this content-based mime detection. One of the challenges again is that the non-GRB file types cannot be clearly defined unless we feed our model with some example data for all of the existing file types in the world, but this seems to be too utopian and a bit less likely, so it is better that the set of class/types is given and defined in advance to minimize the problem domain. Another challenge is the size of the training data; even if we know the types we want to classify, getting enough training data to form a model can be also one of the main factors of success. In our example model, grb data are collected from ftp://hydro1.sci.gsfc.nasa.gov/data/; and we find out that the grb data from that source all exhibit a similar pattern, a simple neural network structure is able to predict well, even a linear logistic regression is able to do a good job; However, if we pass the GRB files collected from other source to the model for prediction, then we find out that the model predict poorly and unexpectedly, so this bring up the aspect of whether we need to include all training data or those are of interest, including all data is very expensive so it is necessary to introduce some domain knowledge to
[jira] [Updated] (TIKA-1582) Mime Detection based on neural networks with Byte-frequency-histogram
[ https://issues.apache.org/jira/browse/TIKA-1582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luke sh updated TIKA-1582: -- Attachment: week2-report-histogram comparison.docx histogram comparison Mime Detection based on neural networks with Byte-frequency-histogram -- Key: TIKA-1582 URL: https://issues.apache.org/jira/browse/TIKA-1582 Project: Tika Issue Type: Improvement Components: detector, mime Affects Versions: 1.7 Reporter: Luke sh Priority: Trivial Attachments: nnmodel.docx, week2-report-histogram comparison.docx, week6 report.docx Content-based mime type detection is one of the popular approaches to detect mime type, there are others based on file extension and magic numbers ; And currently Tika has implemented 3 approaches in detecting mime types; They are : 1) file extensions 2) magic numbers (the most trustworthy in tika) 3) content-type(the header in the http response if present and available) Content-based mime type detection however analyses the distribution of the entire stream of bytes and find a similar pattern for the same type and build a function that is able to group them into one or several classes so as to classify and predict; It is believed this feature might broaden the usage of Tika with a bit more security enforcement for mime type detection. Because we want to build a model that is etched with the patterns it has seen, in some situations we may not trust those types which have not been trained/learned by the model. In some situations, magic numbers imbedded in the files can be copied but the actual content could be a potentially detrimental Troy program. By enforcing the trust on byte frequency patterns, we are able to enhance the security of the detection. The proposed content-based mime detection to be integrated into Tika is based on the machine learning algorithm i.e. neural network with back-propagation. The input: 0-255 bins each of which represent a byte, and and each of which stores the count of occurrences for each byte, and the byte frequency histograms are normalized to fall in the range between 0 and 1, they then are passed to a companding function to enhancement the infrequent bytes. The output of the neural network is a binary decision 1 or 0; Notice BTW, the proposed feature will be implemented with GRB file type as one example. In this example, we build a model that is able to classify GRB file type from non-GRB file types, notice the size of non-GRB files is huge and cannot be easily defined, so there need to be as many negative training example as possible to form this non-GRB types decision boundary. The Neural networks is considered as two stage of processes. Training and classification. The training can be done in any programming language, in this feature /research, the training of neural network is implemented in R and the source can be found in my github repository i.e. https://github.com/LukeLiush/filetypeDetection; i am also going to post a document that describe the use of the program, the syntax/ format of the input and output. After training, we need to export the model and import it to Tika; in Tika, we create a TrainedModelDetector that reads this model file with one or more model parameters or several model files,so it can detect the mime types with the model of those mime types. Details of the research and usage with this proposed feature will be posted on my github shortly. It is worth noting again that in this research we only worked out one model - GRB as one example to demonstrate the use of this content-based mime detection. One of the challenges again is that the non-GRB file types cannot be clearly defined unless we feed our model with some example data for all of the existing file types in the world, but this seems to be too utopian and a bit less likely, so it is better that the set of class/types is given and defined in advance to minimize the problem domain. Another challenge is the size of the training data; even if we know the types we want to classify, getting enough training data to form a model can be also one of the main factors of success. In our example model, grb data are collected from ftp://hydro1.sci.gsfc.nasa.gov/data/; and we find out that the grb data from that source all exhibit a similar pattern, a simple neural network structure is able to predict well, even a linear logistic regression is able to do a good job; However, if we pass the GRB files collected from other source to the model for prediction, then we find out that the model predict poorly and unexpectedly, so this bring up the aspect of whether we need to include all training data or those are of interest,
[jira] [Updated] (TIKA-1582) Mime Detection based on neural networks with Byte-frequency-histogram
[ https://issues.apache.org/jira/browse/TIKA-1582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luke sh updated TIKA-1582: -- Attachment: week6 report.docx Test report Mime Detection based on neural networks with Byte-frequency-histogram -- Key: TIKA-1582 URL: https://issues.apache.org/jira/browse/TIKA-1582 Project: Tika Issue Type: Improvement Components: detector, mime Affects Versions: 1.7 Reporter: Luke sh Priority: Trivial Attachments: nnmodel.docx, week6 report.docx Content-based mime type detection is one of the popular approaches to detect mime type, there are others based on file extension and magic numbers ; And currently Tika has implemented 3 approaches in detecting mime types; They are : 1) file extensions 2) magic numbers (the most trustworthy in tika) 3) content-type(the header in the http response if present and available) Content-based mime type detection however analyses the distribution of the entire stream of bytes and find a similar pattern for the same type and build a function that is able to group them into one or several classes so as to classify and predict; It is believed this feature might broaden the usage of Tika with a bit more security enforcement for mime type detection. Because we want to build a model that is etched with the patterns it has seen, in some situations we may not trust those types which have not been trained/learned by the model. In some situations, magic numbers imbedded in the files can be copied but the actual content could be a potentially detrimental Troy program. By enforcing the trust on byte frequency patterns, we are able to enhance the security of the detection. The proposed content-based mime detection to be integrated into Tika is based on the machine learning algorithm i.e. neural network with back-propagation. The input: 0-255 bins each of which represent a byte, and and each of which stores the count of occurrences for each byte, and the byte frequency histograms are normalized to fall in the range between 0 and 1, they then are passed to a companding function to enhancement the infrequent bytes. The output of the neural network is a binary decision 1 or 0; Notice BTW, the proposed feature will be implemented with GRB file type as one example. In this example, we build a model that is able to classify GRB file type from non-GRB file types, notice the size of non-GRB files is huge and cannot be easily defined, so there need to be as many negative training example as possible to form this non-GRB types decision boundary. The Neural networks is considered as two stage of processes. Training and classification. The training can be done in any programming language, in this feature /research, the training of neural network is implemented in R and the source can be found in my github repository i.e. https://github.com/LukeLiush/filetypeDetection; i am also going to post a document that describe the use of the program, the syntax/ format of the input and output. After training, we need to export the model and import it to Tika; in Tika, we create a TrainedModelDetector that reads this model file with one or more model parameters or several model files,so it can detect the mime types with the model of those mime types. Details of the research and usage with this proposed feature will be posted on my github shortly. It is worth noting again that in this research we only worked out one model - GRB as one example to demonstrate the use of this content-based mime detection. One of the challenges again is that the non-GRB file types cannot be clearly defined unless we feed our model with some example data for all of the existing file types in the world, but this seems to be too utopian and a bit less likely, so it is better that the set of class/types is given and defined in advance to minimize the problem domain. Another challenge is the size of the training data; even if we know the types we want to classify, getting enough training data to form a model can be also one of the main factors of success. In our example model, grb data are collected from ftp://hydro1.sci.gsfc.nasa.gov/data/; and we find out that the grb data from that source all exhibit a similar pattern, a simple neural network structure is able to predict well, even a linear logistic regression is able to do a good job; However, if we pass the GRB files collected from other source to the model for prediction, then we find out that the model predict poorly and unexpectedly, so this bring up the aspect of whether we need to include all training data or those are of interest, including all data is very expensive so it is necessary to introduce some
[jira] [Created] (TIKA-1582) Mime Detection based on neural networks with Byte-frequency-histogram
Luke sh created TIKA-1582: - Summary: Mime Detection based on neural networks with Byte-frequency-histogram Key: TIKA-1582 URL: https://issues.apache.org/jira/browse/TIKA-1582 Project: Tika Issue Type: Improvement Components: detector, mime Affects Versions: 1.7 Reporter: Luke sh Priority: Trivial Content-based mime type detection is one of the popular approaches to detect mime type, there are others based on file extension and magic numbers ; And currently Tika has implemented 3 approaches in detecting mime types; They are : 1) file extensions 2) magic numbers (the most trustworthy in tika) 3) content-type(the header in the http response if present and available) Content-based mime type detection however analyses the distribution of the entire stream of bytes and find a similar pattern for the same type and build a function that is able to group them into one or several classes so as to classify and predict; It is believed this feature might broaden the usage of Tika with a bit more security enforcement for mime type detection. Because we want to build a model that is etched with the patterns it has seen, in some situations we may not trust those types which have not been trained/learned by the model. In some situations, magic numbers imbedded in the files can be copied but the actual content could be a potentially detrimental Troy program. By enforcing the trust on byte frequency patterns, we are able to enhance the security of the detection. The proposed content-based mime detection to be integrated into Tika is based on the machine learning algorithm i.e. neural network with back-propagation. The input: 0-255 bins each of which represent a byte, and and each of which stores the count of occurrences for each byte, and the byte frequency histograms are normalized to fall in the range between 0 and 1, they then are passed to a compounding function to enhancement the infrequent bytes. The output of the neural network is a binary decision 1 or 0; Notice BTW, the proposed feature will be implemented with GRB file type as one example. In this example, we build a model that is able to classify GRB file type from non-GRB file types, notice the size of non-GRB files is huge and cannot be easily defined, so there need to be as many negative training example as possible to form this non-GRB types decision boundary. The Neural networks is considered as two stage of processes. Training and classification. The training can be done in any programming language, in this feature /research, the training of neural network is implemented in R and the source can be found in my github repository i.e. https://github.com/LukeLiush/filetypeDetection; i am also going to post a document that describe the use of the program, the syntax/ format of the input and output. After training, we need to export the model and import it to Tika; in Tika, we create a TrainedModelDetector that reads this model file with one or more model parameters or several model files,so it can detect the mime types with the model of those mime types. Details of the research and usage with this proposed feature will be posted on my github shortly. It is worth noting again that in this research we only worked out one model - GRB as one example to demonstrate the use of this content-based mime detection. One of the challenges again is that the non-GRB file types cannot be clearly defined unless we feed our model with some example data for all of the existing file types in the world, but this seems to be too utopian and a bit less likely, so it is better that the set of class/types is given and defined in advance to minimize the problem domain. Another challenge is the size of the training data; even if we know the types we want to classify, getting enough training data to form a model can be also one of the main factors of success. In our example model, grb data are collected from ftp://hydro1.sci.gsfc.nasa.gov/data/; and we find out that the grb data from that source all exhibit a similar pattern, a simple neural network structure is able to predict well, even a linear logistic regression is able to do a good job; However, if we pass the GRB files collected from other source to the model for prediction, then we find out that the model predict poorly and unexpectedly, so this bring up the aspect of whether we need to include all training data or those are of interest, including all data is very expensive so it is necessary to introduce some domain knowledge to minimize the problem domain; we believe users should know what types they want to classify and they should be able to get enough training data, although getting the training data can be a tedious and expensive process. Again it is better to have that domain knowledge with the set of types present in
[jira] [Commented] (TIKA-1582) Mime Detection based on neural networks with Byte-frequency-histogram
[ https://issues.apache.org/jira/browse/TIKA-1582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14382590#comment-14382590 ] Luke sh commented on TIKA-1582: --- a pull request with this feature for Tika will be created shortly. Mime Detection based on neural networks with Byte-frequency-histogram -- Key: TIKA-1582 URL: https://issues.apache.org/jira/browse/TIKA-1582 Project: Tika Issue Type: Improvement Components: detector, mime Affects Versions: 1.7 Reporter: Luke sh Priority: Trivial Content-based mime type detection is one of the popular approaches to detect mime type, there are others based on file extension and magic numbers ; And currently Tika has implemented 3 approaches in detecting mime types; They are : 1) file extensions 2) magic numbers (the most trustworthy in tika) 3) content-type(the header in the http response if present and available) Content-based mime type detection however analyses the distribution of the entire stream of bytes and find a similar pattern for the same type and build a function that is able to group them into one or several classes so as to classify and predict; It is believed this feature might broaden the usage of Tika with a bit more security enforcement for mime type detection. Because we want to build a model that is etched with the patterns it has seen, in some situations we may not trust those types which have not been trained/learned by the model. In some situations, magic numbers imbedded in the files can be copied but the actual content could be a potentially detrimental Troy program. By enforcing the trust on byte frequency patterns, we are able to enhance the security of the detection. The proposed content-based mime detection to be integrated into Tika is based on the machine learning algorithm i.e. neural network with back-propagation. The input: 0-255 bins each of which represent a byte, and and each of which stores the count of occurrences for each byte, and the byte frequency histograms are normalized to fall in the range between 0 and 1, they then are passed to a compounding function to enhancement the infrequent bytes. The output of the neural network is a binary decision 1 or 0; Notice BTW, the proposed feature will be implemented with GRB file type as one example. In this example, we build a model that is able to classify GRB file type from non-GRB file types, notice the size of non-GRB files is huge and cannot be easily defined, so there need to be as many negative training example as possible to form this non-GRB types decision boundary. The Neural networks is considered as two stage of processes. Training and classification. The training can be done in any programming language, in this feature /research, the training of neural network is implemented in R and the source can be found in my github repository i.e. https://github.com/LukeLiush/filetypeDetection; i am also going to post a document that describe the use of the program, the syntax/ format of the input and output. After training, we need to export the model and import it to Tika; in Tika, we create a TrainedModelDetector that reads this model file with one or more model parameters or several model files,so it can detect the mime types with the model of those mime types. Details of the research and usage with this proposed feature will be posted on my github shortly. It is worth noting again that in this research we only worked out one model - GRB as one example to demonstrate the use of this content-based mime detection. One of the challenges again is that the non-GRB file types cannot be clearly defined unless we feed our model with some example data for all of the existing file types in the world, but this seems to be too utopian and a bit less likely, so it is better that the set of class/types is given and defined in advance to minimize the problem domain. Another challenge is the size of the training data; even if we know the types we want to classify, getting enough training data to form a model can be also one of the main factors of success. In our example model, grb data are collected from ftp://hydro1.sci.gsfc.nasa.gov/data/; and we find out that the grb data from that source all exhibit a similar pattern, a simple neural network structure is able to predict well, even a linear logistic regression is able to do a good job; However, if we pass the GRB files collected from other source to the model for prediction, then we find out that the model predict poorly and unexpectedly, so this bring up the aspect of whether we need to include all training data or those are of interest, including all data is very expensive so it is
[jira] [Updated] (TIKA-1582) Mime Detection based on neural networks with Byte-frequency-histogram
[ https://issues.apache.org/jira/browse/TIKA-1582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luke sh updated TIKA-1582: -- Description: Content-based mime type detection is one of the popular approaches to detect mime type, there are others based on file extension and magic numbers ; And currently Tika has implemented 3 approaches in detecting mime types; They are : 1) file extensions 2) magic numbers (the most trustworthy in tika) 3) content-type(the header in the http response if present and available) Content-based mime type detection however analyses the distribution of the entire stream of bytes and find a similar pattern for the same type and build a function that is able to group them into one or several classes so as to classify and predict; It is believed this feature might broaden the usage of Tika with a bit more security enforcement for mime type detection. Because we want to build a model that is etched with the patterns it has seen, in some situations we may not trust those types which have not been trained/learned by the model. In some situations, magic numbers imbedded in the files can be copied but the actual content could be a potentially detrimental Troy program. By enforcing the trust on byte frequency patterns, we are able to enhance the security of the detection. The proposed content-based mime detection to be integrated into Tika is based on the machine learning algorithm i.e. neural network with back-propagation. The input: 0-255 bins each of which represent a byte, and and each of which stores the count of occurrences for each byte, and the byte frequency histograms are normalized to fall in the range between 0 and 1, they then are passed to a companding function to enhancement the infrequent bytes. The output of the neural network is a binary decision 1 or 0; Notice BTW, the proposed feature will be implemented with GRB file type as one example. In this example, we build a model that is able to classify GRB file type from non-GRB file types, notice the size of non-GRB files is huge and cannot be easily defined, so there need to be as many negative training example as possible to form this non-GRB types decision boundary. The Neural networks is considered as two stage of processes. Training and classification. The training can be done in any programming language, in this feature /research, the training of neural network is implemented in R and the source can be found in my github repository i.e. https://github.com/LukeLiush/filetypeDetection; i am also going to post a document that describe the use of the program, the syntax/ format of the input and output. After training, we need to export the model and import it to Tika; in Tika, we create a TrainedModelDetector that reads this model file with one or more model parameters or several model files,so it can detect the mime types with the model of those mime types. Details of the research and usage with this proposed feature will be posted on my github shortly. It is worth noting again that in this research we only worked out one model - GRB as one example to demonstrate the use of this content-based mime detection. One of the challenges again is that the non-GRB file types cannot be clearly defined unless we feed our model with some example data for all of the existing file types in the world, but this seems to be too utopian and a bit less likely, so it is better that the set of class/types is given and defined in advance to minimize the problem domain. Another challenge is the size of the training data; even if we know the types we want to classify, getting enough training data to form a model can be also one of the main factors of success. In our example model, grb data are collected from ftp://hydro1.sci.gsfc.nasa.gov/data/; and we find out that the grb data from that source all exhibit a similar pattern, a simple neural network structure is able to predict well, even a linear logistic regression is able to do a good job; However, if we pass the GRB files collected from other source to the model for prediction, then we find out that the model predict poorly and unexpectedly, so this bring up the aspect of whether we need to include all training data or those are of interest, including all data is very expensive so it is necessary to introduce some domain knowledge to minimize the problem domain; we believe users should know what types they want to classify and they should be able to get enough training data, although getting the training data can be a tedious and expensive process. Again it is better to have that domain knowledge with the set of types present in users' database and train a model with some examples for every type in the database. was: Content-based mime type detection is one of the popular approaches to detect mime type, there are others based on file extension and magic numbers ;
[jira] [Updated] (TIKA-1561) GCMD Directory Interchange Format (.dif) identification
[ https://issues.apache.org/jira/browse/TIKA-1561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luke sh updated TIKA-1561: -- Description: cited from the http://gcmd.nasa.gov/add/difguide/WRITEADIF.pdf The Directory Interchange Format (DIF) is metadata format used to create directory entries that describe scientific data sets. A DIF holds a collection of fields, which detail specific information about the data. The .dif file respect proper xml format that describe the scientific data set, the schema xsd files can be found inside the .dif xml file. i,e, http://gcmd.gsfc.nasa.gov/Aboutus/xml/dif/dif_v9.8.4.xsd The reason opening this ticket is tika parser for this dif file is being under consideration with development, the support to identify the file is needed. Although dif file in this case seems to be an xml file which can be parsed properly by xmlparser, still it might need a specific process on some of the fields to be extracted and injected into the System for analysis. Then it is proposed that the following type 'text/dif+xml' is appended in the tika-mimetypes.xml that extends the application/xml, so that some special process can be applied to this particular xml file. mime-type type=text/dif+xml root-XML localName=DIF/ root-XML localName=DIF namespaceURI=http://gcmd.gsfc.nasa.gov/Aboutus/xml/dif// glob pattern=*.dif/ sub-class-of type=application/xml/ /mime-type Expected MIME type: text/dif+xml The following is the link to the dif format guide http://gcmd.nasa.gov/add/difguide/ example dif files: 1) https://www.aoncadis.org/dataset/id/005f3222-7548-11e2-851e-00c0f03d5b7c.dif 2) https://www.aoncadis.org/dataset/id/0091cf0c-7ad3-11e2-851e-00c0f03d5b7c.dif 3) https://www.aoncadis.org/dataset/id/02a6301c-3ab3-11e4-8ee7-00c0f03d5b7c.dif an example dif file has also been attached. was: cited from the http://gcmd.nasa.gov/add/difguide/WRITEADIF.pdf The Directory Interchange Format (DIF) is metadata format used to create directory entries that describe scientific data sets. A DIF holds a collection of fields, which detail specific information about the data. The .dif file respect proper xml format that describe the scientific data set, the schema xsd files can be found inside the .dif xml file. i,e, http://gcmd.gsfc.nasa.gov/Aboutus/xml/dif/dif_v9.8.4.xsd The reason opening this ticket is tika parser for this dif file is being under consideration with development, the support to identify the file is needed. Although dif file in this case seems to be an xml file which can be parsed properly by xmlparser, still it might need a specific process on some of the fields to be extracted and injected into the System for analysis. Then it is decided that the following type 'text/dif+xml' is used that extends the application/xml, so that we can apply some special process to this particular xml file. mime-type type=text/dif+xml root-XML localName=DIF/ root-XML localName=DIF namespaceURI=http://gcmd.gsfc.nasa.gov/Aboutus/xml/dif// glob pattern=*.dif/ sub-class-of type=application/xml/ /mime-type Expected MIME type: text/dif+xml The following is the link to the dif format guide http://gcmd.nasa.gov/add/difguide/ example dif files: 1) https://www.aoncadis.org/dataset/id/005f3222-7548-11e2-851e-00c0f03d5b7c.dif 2) https://www.aoncadis.org/dataset/id/0091cf0c-7ad3-11e2-851e-00c0f03d5b7c.dif 3) https://www.aoncadis.org/dataset/id/02a6301c-3ab3-11e4-8ee7-00c0f03d5b7c.dif an example dif file has also been attached. GCMD Directory Interchange Format (.dif) identification --- Key: TIKA-1561 URL: https://issues.apache.org/jira/browse/TIKA-1561 Project: Tika Issue Type: Improvement Components: mime Affects Versions: 1.7 Reporter: Luke sh Priority: Trivial Attachments: carbon_isotopic_values_of_alkanes_extracted_from_paleosols.dif cited from the http://gcmd.nasa.gov/add/difguide/WRITEADIF.pdf The Directory Interchange Format (DIF) is metadata format used to create directory entries that describe scientific data sets. A DIF holds a collection of fields, which detail specific information about the data. The .dif file respect proper xml format that describe the scientific data set, the schema xsd files can be found inside the .dif xml file. i,e, http://gcmd.gsfc.nasa.gov/Aboutus/xml/dif/dif_v9.8.4.xsd The reason opening this ticket is tika parser for this dif file is being under consideration with development, the support to identify the file is needed. Although dif file in this case seems to be an xml file which can be parsed properly by xmlparser, still it might need a specific process on some of the fields to be extracted and injected into the System for analysis. Then it is proposed that the following type 'text/dif+xml' is
[jira] [Updated] (TIKA-1561) GCMD Directory Interchange Format (.dif) identification
[ https://issues.apache.org/jira/browse/TIKA-1561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luke sh updated TIKA-1561: -- Attachment: (was: carbon_isotopic_values_of_alkanes_extracted_from_paleosols.dif) GCMD Directory Interchange Format (.dif) identification --- Key: TIKA-1561 URL: https://issues.apache.org/jira/browse/TIKA-1561 Project: Tika Issue Type: Improvement Components: mime Affects Versions: 1.7 Reporter: Luke sh Priority: Trivial cited from the http://gcmd.nasa.gov/add/difguide/WRITEADIF.pdf The Directory Interchange Format (DIF) is metadata format used to create directory entries that describe scientific data sets. A DIF holds a collection of fields, which detail specific information about the data. The .dif file respect proper xml format that describe the scientific data set, the schema xsd files can be found inside the .dif xml file. i,e, http://gcmd.gsfc.nasa.gov/Aboutus/xml/dif/dif_v9.8.4.xsd The reason opening this ticket is tika parser for this dif file is being under consideration with development, the support to identify the file is needed. Although dif file in this case seems to be an xml file which can be parsed properly by xmlparser, still it might need a specific process on some of the fields to be extracted and injected into the System for analysis. Then it is decided that the following type 'text/dif+xml' is used that extends the application/xml, so that we can apply some special process to this particular xml file. mime-type type=text/dif+xml root-XML localName=DIF/ root-XML localName=DIF namespaceURI=http://gcmd.gsfc.nasa.gov/Aboutus/xml/dif// glob pattern=*.dif/ sub-class-of type=application/xml/ /mime-type Expected MIME type: text/dif+xml The following is the link to the dif format guide http://gcmd.nasa.gov/add/difguide/ example dif files: 1) https://www.aoncadis.org/dataset/id/005f3222-7548-11e2-851e-00c0f03d5b7c.dif 2) https://www.aoncadis.org/dataset/id/0091cf0c-7ad3-11e2-851e-00c0f03d5b7c.dif 3) https://www.aoncadis.org/dataset/id/02a6301c-3ab3-11e4-8ee7-00c0f03d5b7c.dif an example dif file has also been attached. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Issue Comment Deleted] (TIKA-1561) GCMD Directory Interchange Format (.dif) identification
[ https://issues.apache.org/jira/browse/TIKA-1561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luke sh updated TIKA-1561: -- Comment: was deleted (was: sample dif file) GCMD Directory Interchange Format (.dif) identification --- Key: TIKA-1561 URL: https://issues.apache.org/jira/browse/TIKA-1561 Project: Tika Issue Type: Improvement Components: mime Affects Versions: 1.7 Reporter: Luke sh Priority: Trivial cited from the http://gcmd.nasa.gov/add/difguide/WRITEADIF.pdf The Directory Interchange Format (DIF) is metadata format used to create directory entries that describe scientific data sets. A DIF holds a collection of fields, which detail specific information about the data. The .dif file respect proper xml format that describe the scientific data set, the schema xsd files can be found inside the .dif xml file. i,e, http://gcmd.gsfc.nasa.gov/Aboutus/xml/dif/dif_v9.8.4.xsd The reason opening this ticket is tika parser for this dif file is being under consideration with development, the support to identify the file is needed. Although dif file in this case seems to be an xml file which can be parsed properly by xmlparser, still it might need a specific process on some of the fields to be extracted and injected into the System for analysis. Then it is decided that the following type 'text/dif+xml' is used that extends the application/xml, so that we can apply some special process to this particular xml file. mime-type type=text/dif+xml root-XML localName=DIF/ root-XML localName=DIF namespaceURI=http://gcmd.gsfc.nasa.gov/Aboutus/xml/dif// glob pattern=*.dif/ sub-class-of type=application/xml/ /mime-type Expected MIME type: text/dif+xml The following is the link to the dif format guide http://gcmd.nasa.gov/add/difguide/ example dif files: 1) https://www.aoncadis.org/dataset/id/005f3222-7548-11e2-851e-00c0f03d5b7c.dif 2) https://www.aoncadis.org/dataset/id/0091cf0c-7ad3-11e2-851e-00c0f03d5b7c.dif 3) https://www.aoncadis.org/dataset/id/02a6301c-3ab3-11e4-8ee7-00c0f03d5b7c.dif an example dif file has also been attached. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-1561) GCMD Directory Interchange Format (.dif) identification
Luke sh created TIKA-1561: - Summary: GCMD Directory Interchange Format (.dif) identification Key: TIKA-1561 URL: https://issues.apache.org/jira/browse/TIKA-1561 Project: Tika Issue Type: Improvement Components: mime Affects Versions: 1.7 Reporter: Luke sh Priority: Trivial cited from the http://gcmd.nasa.gov/add/difguide/WRITEADIF.pdf The Directory Interchange Format (DIF) is metadata format used to create directory entries that describe scientific data sets. A DIF holds a collection of fields, which detail specific information about the data. The .dif file respect proper xml format that describe the scientific data set, the schema xsd files can be found inside the .dif xml file. i,e, http://gcmd.gsfc.nasa.gov/Aboutus/xml/dif/dif_v9.8.4.xsd The reason opening this ticket is tika parser for this dif file is being under consideration with development, the support to identify the file is needed. Although dif file in this case seems to be an xml file which can be parsed properly by xmlparser, still it might need a specific process on some of the fields to be extracted and injected into the System for analysis. Then it is decided that the following type 'text/dif+xml' is used that extends the application/xml, so that we can apply some special process to this particular xml file. mime-type type=text/dif+xml root-XML localName=DIF/ root-XML localName=DIF namespaceURI=http://gcmd.gsfc.nasa.gov/Aboutus/xml/dif// glob pattern=*.dif/ sub-class-of type=application/xml/ /mime-type Expected MIME type: text/dif+xml The following is the link to the dif format guide http://gcmd.nasa.gov/add/difguide/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1561) GCMD Directory Interchange Format (.dif) identification
[ https://issues.apache.org/jira/browse/TIKA-1561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luke sh updated TIKA-1561: -- Attachment: carbon_isotopic_values_of_alkanes_extracted_from_paleosols.dif sample dif file GCMD Directory Interchange Format (.dif) identification --- Key: TIKA-1561 URL: https://issues.apache.org/jira/browse/TIKA-1561 Project: Tika Issue Type: Improvement Components: mime Affects Versions: 1.7 Reporter: Luke sh Priority: Trivial Attachments: carbon_isotopic_values_of_alkanes_extracted_from_paleosols.dif cited from the http://gcmd.nasa.gov/add/difguide/WRITEADIF.pdf The Directory Interchange Format (DIF) is metadata format used to create directory entries that describe scientific data sets. A DIF holds a collection of fields, which detail specific information about the data. The .dif file respect proper xml format that describe the scientific data set, the schema xsd files can be found inside the .dif xml file. i,e, http://gcmd.gsfc.nasa.gov/Aboutus/xml/dif/dif_v9.8.4.xsd The reason opening this ticket is tika parser for this dif file is being under consideration with development, the support to identify the file is needed. Although dif file in this case seems to be an xml file which can be parsed properly by xmlparser, still it might need a specific process on some of the fields to be extracted and injected into the System for analysis. Then it is decided that the following type 'text/dif+xml' is used that extends the application/xml, so that we can apply some special process to this particular xml file. mime-type type=text/dif+xml root-XML localName=DIF/ root-XML localName=DIF namespaceURI=http://gcmd.gsfc.nasa.gov/Aboutus/xml/dif// glob pattern=*.dif/ sub-class-of type=application/xml/ /mime-type Expected MIME type: text/dif+xml The following is the link to the dif format guide http://gcmd.nasa.gov/add/difguide/ example dif files: 1) https://www.aoncadis.org/dataset/id/005f3222-7548-11e2-851e-00c0f03d5b7c.dif 2) https://www.aoncadis.org/dataset/id/0091cf0c-7ad3-11e2-851e-00c0f03d5b7c.dif 3) https://www.aoncadis.org/dataset/id/02a6301c-3ab3-11e4-8ee7-00c0f03d5b7c.dif an example dif file has also been attached. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1561) GCMD Directory Interchange Format (.dif) identification
[ https://issues.apache.org/jira/browse/TIKA-1561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luke sh updated TIKA-1561: -- Description: cited from the http://gcmd.nasa.gov/add/difguide/WRITEADIF.pdf The Directory Interchange Format (DIF) is metadata format used to create directory entries that describe scientific data sets. A DIF holds a collection of fields, which detail specific information about the data. The .dif file respect proper xml format that describe the scientific data set, the schema xsd files can be found inside the .dif xml file. i,e, http://gcmd.gsfc.nasa.gov/Aboutus/xml/dif/dif_v9.8.4.xsd The reason opening this ticket is tika parser for this dif file is being under consideration with development, the support to identify the file is needed. Although dif file in this case seems to be an xml file which can be parsed properly by xmlparser, still it might need a specific process on some of the fields to be extracted and injected into the System for analysis. Then it is decided that the following type 'text/dif+xml' is used that extends the application/xml, so that we can apply some special process to this particular xml file. mime-type type=text/dif+xml root-XML localName=DIF/ root-XML localName=DIF namespaceURI=http://gcmd.gsfc.nasa.gov/Aboutus/xml/dif// glob pattern=*.dif/ sub-class-of type=application/xml/ /mime-type Expected MIME type: text/dif+xml The following is the link to the dif format guide http://gcmd.nasa.gov/add/difguide/ example dif files: 1) https://www.aoncadis.org/dataset/id/005f3222-7548-11e2-851e-00c0f03d5b7c.dif 2) https://www.aoncadis.org/dataset/id/0091cf0c-7ad3-11e2-851e-00c0f03d5b7c.dif 3) https://www.aoncadis.org/dataset/id/02a6301c-3ab3-11e4-8ee7-00c0f03d5b7c.dif an example dif file has also been attached. was: cited from the http://gcmd.nasa.gov/add/difguide/WRITEADIF.pdf The Directory Interchange Format (DIF) is metadata format used to create directory entries that describe scientific data sets. A DIF holds a collection of fields, which detail specific information about the data. The .dif file respect proper xml format that describe the scientific data set, the schema xsd files can be found inside the .dif xml file. i,e, http://gcmd.gsfc.nasa.gov/Aboutus/xml/dif/dif_v9.8.4.xsd The reason opening this ticket is tika parser for this dif file is being under consideration with development, the support to identify the file is needed. Although dif file in this case seems to be an xml file which can be parsed properly by xmlparser, still it might need a specific process on some of the fields to be extracted and injected into the System for analysis. Then it is decided that the following type 'text/dif+xml' is used that extends the application/xml, so that we can apply some special process to this particular xml file. mime-type type=text/dif+xml root-XML localName=DIF/ root-XML localName=DIF namespaceURI=http://gcmd.gsfc.nasa.gov/Aboutus/xml/dif// glob pattern=*.dif/ sub-class-of type=application/xml/ /mime-type Expected MIME type: text/dif+xml The following is the link to the dif format guide http://gcmd.nasa.gov/add/difguide/ GCMD Directory Interchange Format (.dif) identification --- Key: TIKA-1561 URL: https://issues.apache.org/jira/browse/TIKA-1561 Project: Tika Issue Type: Improvement Components: mime Affects Versions: 1.7 Reporter: Luke sh Priority: Trivial Attachments: carbon_isotopic_values_of_alkanes_extracted_from_paleosols.dif cited from the http://gcmd.nasa.gov/add/difguide/WRITEADIF.pdf The Directory Interchange Format (DIF) is metadata format used to create directory entries that describe scientific data sets. A DIF holds a collection of fields, which detail specific information about the data. The .dif file respect proper xml format that describe the scientific data set, the schema xsd files can be found inside the .dif xml file. i,e, http://gcmd.gsfc.nasa.gov/Aboutus/xml/dif/dif_v9.8.4.xsd The reason opening this ticket is tika parser for this dif file is being under consideration with development, the support to identify the file is needed. Although dif file in this case seems to be an xml file which can be parsed properly by xmlparser, still it might need a specific process on some of the fields to be extracted and injected into the System for analysis. Then it is decided that the following type 'text/dif+xml' is used that extends the application/xml, so that we can apply some special process to this particular xml file. mime-type type=text/dif+xml root-XML localName=DIF/ root-XML localName=DIF namespaceURI=http://gcmd.gsfc.nasa.gov/Aboutus/xml/dif// glob pattern=*.dif/ sub-class-of type=application/xml/ /mime-type
[jira] [Commented] (TIKA-1561) GCMD Directory Interchange Format (.dif) identification
[ https://issues.apache.org/jira/browse/TIKA-1561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14337774#comment-14337774 ] Luke sh commented on TIKA-1561: --- I am going to send an pull request with this dif type identification... working in progress GCMD Directory Interchange Format (.dif) identification --- Key: TIKA-1561 URL: https://issues.apache.org/jira/browse/TIKA-1561 Project: Tika Issue Type: Improvement Components: mime Affects Versions: 1.7 Reporter: Luke sh Priority: Trivial Attachments: carbon_isotopic_values_of_alkanes_extracted_from_paleosols.dif cited from the http://gcmd.nasa.gov/add/difguide/WRITEADIF.pdf The Directory Interchange Format (DIF) is metadata format used to create directory entries that describe scientific data sets. A DIF holds a collection of fields, which detail specific information about the data. The .dif file respect proper xml format that describe the scientific data set, the schema xsd files can be found inside the .dif xml file. i,e, http://gcmd.gsfc.nasa.gov/Aboutus/xml/dif/dif_v9.8.4.xsd The reason opening this ticket is tika parser for this dif file is being under consideration with development, the support to identify the file is needed. Although dif file in this case seems to be an xml file which can be parsed properly by xmlparser, still it might need a specific process on some of the fields to be extracted and injected into the System for analysis. Then it is decided that the following type 'text/dif+xml' is used that extends the application/xml, so that we can apply some special process to this particular xml file. mime-type type=text/dif+xml root-XML localName=DIF/ root-XML localName=DIF namespaceURI=http://gcmd.gsfc.nasa.gov/Aboutus/xml/dif// glob pattern=*.dif/ sub-class-of type=application/xml/ /mime-type Expected MIME type: text/dif+xml The following is the link to the dif format guide http://gcmd.nasa.gov/add/difguide/ example dif files: 1) https://www.aoncadis.org/dataset/id/005f3222-7548-11e2-851e-00c0f03d5b7c.dif 2) https://www.aoncadis.org/dataset/id/0091cf0c-7ad3-11e2-851e-00c0f03d5b7c.dif 3) https://www.aoncadis.org/dataset/id/02a6301c-3ab3-11e4-8ee7-00c0f03d5b7c.dif an example dif file has also been attached. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1561) GCMD Directory Interchange Format (.dif) identification
[ https://issues.apache.org/jira/browse/TIKA-1561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luke sh updated TIKA-1561: -- Attachment: carbon_isotopic_values_of_alkanes_extracted_from_paleosols.dif sample dif file GCMD Directory Interchange Format (.dif) identification --- Key: TIKA-1561 URL: https://issues.apache.org/jira/browse/TIKA-1561 Project: Tika Issue Type: Improvement Components: mime Affects Versions: 1.7 Reporter: Luke sh Priority: Trivial Attachments: carbon_isotopic_values_of_alkanes_extracted_from_paleosols.dif cited from the http://gcmd.nasa.gov/add/difguide/WRITEADIF.pdf The Directory Interchange Format (DIF) is metadata format used to create directory entries that describe scientific data sets. A DIF holds a collection of fields, which detail specific information about the data. The .dif file respect proper xml format that describe the scientific data set, the schema xsd files can be found inside the .dif xml file. i,e, http://gcmd.gsfc.nasa.gov/Aboutus/xml/dif/dif_v9.8.4.xsd The reason opening this ticket is tika parser for this dif file is being under consideration with development, the support to identify the file is needed. Although dif file in this case seems to be an xml file which can be parsed properly by xmlparser, still it might need a specific process on some of the fields to be extracted and injected into the System for analysis. Then it is decided that the following type 'text/dif+xml' is used that extends the application/xml, so that we can apply some special process to this particular xml file. mime-type type=text/dif+xml root-XML localName=DIF/ root-XML localName=DIF namespaceURI=http://gcmd.gsfc.nasa.gov/Aboutus/xml/dif// glob pattern=*.dif/ sub-class-of type=application/xml/ /mime-type Expected MIME type: text/dif+xml The following is the link to the dif format guide http://gcmd.nasa.gov/add/difguide/ example dif files: 1) https://www.aoncadis.org/dataset/id/005f3222-7548-11e2-851e-00c0f03d5b7c.dif 2) https://www.aoncadis.org/dataset/id/0091cf0c-7ad3-11e2-851e-00c0f03d5b7c.dif 3) https://www.aoncadis.org/dataset/id/02a6301c-3ab3-11e4-8ee7-00c0f03d5b7c.dif an example dif file has also been attached. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1561) GCMD Directory Interchange Format (.dif) identification
[ https://issues.apache.org/jira/browse/TIKA-1561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luke sh updated TIKA-1561: -- Description: cited from the http://gcmd.nasa.gov/add/difguide/WRITEADIF.pdf The Directory Interchange Format (DIF) is metadata format used to create directory entries that describe scientific data sets. A DIF holds a collection of fields, which detail specific information about the data. The .dif file respect proper xml format that describe the scientific data set, the schema xsd files can be found inside the .dif xml file. i,e, http://gcmd.gsfc.nasa.gov/Aboutus/xml/dif/dif_v9.8.4.xsd The reason opening this ticket is tika parser for this dif file is being under consideration with development, the support to identify the type of xml file is needed. Although dif file in this case seems to be an proper xml file which can be parsed by xmlparser, still it might need a specific process on some of the fields to be extracted and injected into the Solr System for analysis. Then it is proposed that the following type 'text/dif+xml' is appended in the tika-mimetypes.xml that extends the application/xml, so that some special process can be applied to this particular xml file. mime-type type=text/dif+xml root-XML localName=DIF/ root-XML localName=DIF namespaceURI=http://gcmd.gsfc.nasa.gov/Aboutus/xml/dif// glob pattern=*.dif/ sub-class-of type=application/xml/ /mime-type Expected MIME type: text/dif+xml The following is the link to the dif format guide http://gcmd.nasa.gov/add/difguide/ example dif files: 1) https://www.aoncadis.org/dataset/id/005f3222-7548-11e2-851e-00c0f03d5b7c.dif 2) https://www.aoncadis.org/dataset/id/0091cf0c-7ad3-11e2-851e-00c0f03d5b7c.dif 3) https://www.aoncadis.org/dataset/id/02a6301c-3ab3-11e4-8ee7-00c0f03d5b7c.dif an example dif file has also been attached. was: cited from the http://gcmd.nasa.gov/add/difguide/WRITEADIF.pdf The Directory Interchange Format (DIF) is metadata format used to create directory entries that describe scientific data sets. A DIF holds a collection of fields, which detail specific information about the data. The .dif file respect proper xml format that describe the scientific data set, the schema xsd files can be found inside the .dif xml file. i,e, http://gcmd.gsfc.nasa.gov/Aboutus/xml/dif/dif_v9.8.4.xsd The reason opening this ticket is tika parser for this dif file is being under consideration with development, the support to identify the file is needed. Although dif file in this case seems to be an xml file which can be parsed properly by xmlparser, still it might need a specific process on some of the fields to be extracted and injected into the System for analysis. Then it is proposed that the following type 'text/dif+xml' is appended in the tika-mimetypes.xml that extends the application/xml, so that some special process can be applied to this particular xml file. mime-type type=text/dif+xml root-XML localName=DIF/ root-XML localName=DIF namespaceURI=http://gcmd.gsfc.nasa.gov/Aboutus/xml/dif// glob pattern=*.dif/ sub-class-of type=application/xml/ /mime-type Expected MIME type: text/dif+xml The following is the link to the dif format guide http://gcmd.nasa.gov/add/difguide/ example dif files: 1) https://www.aoncadis.org/dataset/id/005f3222-7548-11e2-851e-00c0f03d5b7c.dif 2) https://www.aoncadis.org/dataset/id/0091cf0c-7ad3-11e2-851e-00c0f03d5b7c.dif 3) https://www.aoncadis.org/dataset/id/02a6301c-3ab3-11e4-8ee7-00c0f03d5b7c.dif an example dif file has also been attached. GCMD Directory Interchange Format (.dif) identification --- Key: TIKA-1561 URL: https://issues.apache.org/jira/browse/TIKA-1561 Project: Tika Issue Type: Improvement Components: mime Affects Versions: 1.7 Reporter: Luke sh Priority: Trivial Attachments: carbon_isotopic_values_of_alkanes_extracted_from_paleosols.dif cited from the http://gcmd.nasa.gov/add/difguide/WRITEADIF.pdf The Directory Interchange Format (DIF) is metadata format used to create directory entries that describe scientific data sets. A DIF holds a collection of fields, which detail specific information about the data. The .dif file respect proper xml format that describe the scientific data set, the schema xsd files can be found inside the .dif xml file. i,e, http://gcmd.gsfc.nasa.gov/Aboutus/xml/dif/dif_v9.8.4.xsd The reason opening this ticket is tika parser for this dif file is being under consideration with development, the support to identify the type of xml file is needed. Although dif file in this case seems to be an proper xml file which can be parsed by xmlparser, still it might need a specific process on some of the fields to be extracted and injected into the Solr System for analysis.
[jira] [Updated] (TIKA-1561) GCMD Directory Interchange Format (.dif) identification
[ https://issues.apache.org/jira/browse/TIKA-1561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luke sh updated TIKA-1561: -- Description: cited from the http://gcmd.nasa.gov/add/difguide/WRITEADIF.pdf The Directory Interchange Format (DIF) is metadata format used to create directory entries that describe scientific data sets. A DIF holds a collection of fields, which detail specific information about the data. The .dif file respect proper xml format that describe the scientific data set, the schema xsd files can be found inside the .dif xml file. i,e, http://gcmd.gsfc.nasa.gov/Aboutus/xml/dif/dif_v9.8.4.xsd The reason opening this ticket is tika parser for this dif file is being under consideration with development, the support to identify the type of xml file is needed. Although dif file in this case seems to be an proper xml file which can be parsed by xmlparser, still it might need a specific process on some of the fields to be extracted and injected into the Solr System for analysis. Then it is proposed that the following type 'text/dif+xml' is appended and used in the tika-mimetypes.xml to be able to support the specific xml type detection which extends the application/xml, so that some special process can be applied to this particular xml file. mime-type type=text/dif+xml root-XML localName=DIF/ root-XML localName=DIF namespaceURI=http://gcmd.gsfc.nasa.gov/Aboutus/xml/dif// glob pattern=*.dif/ sub-class-of type=application/xml/ /mime-type Expected MIME type: text/dif+xml The following is the link to the dif format guide http://gcmd.nasa.gov/add/difguide/ example dif files: 1) https://www.aoncadis.org/dataset/id/005f3222-7548-11e2-851e-00c0f03d5b7c.dif 2) https://www.aoncadis.org/dataset/id/0091cf0c-7ad3-11e2-851e-00c0f03d5b7c.dif 3) https://www.aoncadis.org/dataset/id/02a6301c-3ab3-11e4-8ee7-00c0f03d5b7c.dif an example dif file has also been attached. was: cited from the http://gcmd.nasa.gov/add/difguide/WRITEADIF.pdf The Directory Interchange Format (DIF) is metadata format used to create directory entries that describe scientific data sets. A DIF holds a collection of fields, which detail specific information about the data. The .dif file respect proper xml format that describe the scientific data set, the schema xsd files can be found inside the .dif xml file. i,e, http://gcmd.gsfc.nasa.gov/Aboutus/xml/dif/dif_v9.8.4.xsd The reason opening this ticket is tika parser for this dif file is being under consideration with development, the support to identify the type of xml file is needed. Although dif file in this case seems to be an proper xml file which can be parsed by xmlparser, still it might need a specific process on some of the fields to be extracted and injected into the Solr System for analysis. Then it is proposed that the following type 'text/dif+xml' is appended in the tika-mimetypes.xml that extends the application/xml, so that some special process can be applied to this particular xml file. mime-type type=text/dif+xml root-XML localName=DIF/ root-XML localName=DIF namespaceURI=http://gcmd.gsfc.nasa.gov/Aboutus/xml/dif// glob pattern=*.dif/ sub-class-of type=application/xml/ /mime-type Expected MIME type: text/dif+xml The following is the link to the dif format guide http://gcmd.nasa.gov/add/difguide/ example dif files: 1) https://www.aoncadis.org/dataset/id/005f3222-7548-11e2-851e-00c0f03d5b7c.dif 2) https://www.aoncadis.org/dataset/id/0091cf0c-7ad3-11e2-851e-00c0f03d5b7c.dif 3) https://www.aoncadis.org/dataset/id/02a6301c-3ab3-11e4-8ee7-00c0f03d5b7c.dif an example dif file has also been attached. GCMD Directory Interchange Format (.dif) identification --- Key: TIKA-1561 URL: https://issues.apache.org/jira/browse/TIKA-1561 Project: Tika Issue Type: Improvement Components: mime Affects Versions: 1.7 Reporter: Luke sh Priority: Trivial Attachments: carbon_isotopic_values_of_alkanes_extracted_from_paleosols.dif cited from the http://gcmd.nasa.gov/add/difguide/WRITEADIF.pdf The Directory Interchange Format (DIF) is metadata format used to create directory entries that describe scientific data sets. A DIF holds a collection of fields, which detail specific information about the data. The .dif file respect proper xml format that describe the scientific data set, the schema xsd files can be found inside the .dif xml file. i,e, http://gcmd.gsfc.nasa.gov/Aboutus/xml/dif/dif_v9.8.4.xsd The reason opening this ticket is tika parser for this dif file is being under consideration with development, the support to identify the type of xml file is needed. Although dif file in this case seems to be an proper xml file which can be parsed by xmlparser, still it might need a specific process on
[jira] [Commented] (TIKA-1539) GRB file magic bytes and extension matching
[ https://issues.apache.org/jira/browse/TIKA-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14308212#comment-14308212 ] Luke sh commented on TIKA-1539: --- pull request #28, add grb files for unit tests. GRB file magic bytes and extension matching Key: TIKA-1539 URL: https://issues.apache.org/jira/browse/TIKA-1539 Project: Tika Issue Type: Improvement Components: mime Affects Versions: 1.7 Reporter: Luke sh GRB type detection with magic bytes and extension probably needs to be supported Tika, the GRB parser is under development, so it may be good to have its magic bytes and extension matching detection. However, GRB does not have standard mime type, the following extension and MAGIC matching settings in the tika-mimetypes.xml are proposed to used for GRB mime type idenfication. mime-type type=application/x-grib acronymGRIB/acronym _commentGeneral Regularly-distributed Information in Binary form/_comment tika:linkhttp://en.wikipedia.org/wiki/GRIB/tika:link magic priority=50 match value=GRIB type=string offset=0/ /magic glob pattern=*.grb/ glob pattern=*.grb1/ glob pattern=*.grb2/ Any kind suggestion and advice will be welcomed and appreciated. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-1539) GRB file magic bytes and extension matching
Luke sh created TIKA-1539: - Summary: GRB file magic bytes and extension matching Key: TIKA-1539 URL: https://issues.apache.org/jira/browse/TIKA-1539 Project: Tika Issue Type: Improvement Components: mime Affects Versions: 1.7 Reporter: Luke sh GRB type detection with magic bytes and extension probably needs to be supported Tika, the GRB parser is under development, so it may be good to have its magic bytes and extension matching detection. However, GRB does not have standard mime type, the following extension and MAGIC matching settings in the tika-mimetypes.xml are proposed to used for GRB mime type idenfication. mime-type type=application/x-grib acronymGRIB/acronym _commentGeneral Regularly-distributed Information in Binary form/_comment tika:linkhttp://en.wikipedia.org/wiki/GRIB/tika:link magic priority=50 match value=GRIB type=string offset=0/ /magic glob pattern=*.grb/ glob pattern=*.grb1/ glob pattern=*.grb2/ Any kind suggestion and advice will be welcomed and appreciated. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1535) Inheritance modification for the class MIMETypes
[ https://issues.apache.org/jira/browse/TIKA-1535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luke sh updated TIKA-1535: -- Description: The Class MIMETypes does not currently allow for inheritance. There are a couple of methods in this class which looks independent, and some of which needs to be exposed or overwritten for special needs or use cases, this will enable tika users with more flexibility for new mime detection algorithm. Perhaps it may be a good idea to extract out the detector logic from the MimeTypes class, and create an independent detector for Tika. was: The Class MIMETypes does not currently allow for inheritance. There are a couple of methods in this class which looks independent, and some of which needs to be exposed or overwritten for special needs or use cases, this will enable tika users with more flexibility for new mime detection algorithm. Inheritance modification for the class MIMETypes Key: TIKA-1535 URL: https://issues.apache.org/jira/browse/TIKA-1535 Project: Tika Issue Type: Improvement Components: mime Reporter: Luke sh Priority: Trivial The Class MIMETypes does not currently allow for inheritance. There are a couple of methods in this class which looks independent, and some of which needs to be exposed or overwritten for special needs or use cases, this will enable tika users with more flexibility for new mime detection algorithm. Perhaps it may be a good idea to extract out the detector logic from the MimeTypes class, and create an independent detector for Tika. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1517) MIME type selection with probability
[ https://issues.apache.org/jira/browse/TIKA-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14295928#comment-14295928 ] Luke sh commented on TIKA-1517: --- the probability selection will inherit the class MIMETypes, which needs to be modified by exposing some of its methods for being able to inherit and overwrite, TIKA-1535. MIME type selection with probability Key: TIKA-1517 URL: https://issues.apache.org/jira/browse/TIKA-1517 Project: Tika Issue Type: Improvement Components: mime Affects Versions: 0.1-incubating, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.10, 1.0, 1.1, 1.2, 1.3, 1.4, 1.5, 1.6 Reporter: Luke sh Attachments: BaysianTest.java Improvement and intuition The original implementation for MIME type selection/detection is a bit less flexible by initial design, as it heavily relies on the outcome produced by magic-bytes MIME Type identification; Thus e.g. if magic-bytes is applicable in a file, Tika will follow the file type detected by magic-bytes. It may be better to provide more control over the method of choice. This proposed approach slightly incorporate the Bayesian probability theorem, where users are able to assign weights to each approach in terms of probability, so they have the control over preference of which file type or mime type identification methods implemented/available in Tika, and currently there are 3 methods for identifying MIME type in Tika (i.e. Magic-Bytes, File extension and Metadata content-type hint). By introducing some weights on the approach in the proposed approach, users are able to choose which method they trust most, the magic-bytes method is often trust-worthy though. But the virtue is that in some situations, file type identification must be sensitive, some might want all of the MIME type identification methods to agree on the same file type before they start processing those files, incorrect file type identification is less intolerable. The current implementation seems to be less flexible for this purpose and heavily rely on the Magic-bytes file identification method (although magic-bytes is most reliable compared to the other 2 ); Proposed design: The idea of selection is to incorporate probability as weights on each MIME type identification method currently being implemented in Tika (they are Magic bytes approach, file extension match and metadata content-type hint). for example, as an user, i would probably like to assign the the preference to the method based on the degree of the trust, and order the results if they don't coincide. Bayesian rule may be a bit appropriate here to meet the intuition. The following is what are needed for Bayesian rule implementation. Prior probability P(file_type) e.g. P(pdf), theoretically this is computed based on the samples, and this depends on the domain or use cases, intuitively we more care the orders of the weights or probability of the results rather than the actual numbers, and also the context of Prior depends on samples for a particular use case or domain, e.g. if we happen to crawl a website that contains mostly the pdf files, we probably can collect some samples and compute the prior, based on the samples we can say 90% of docs are pdf, so our prior is defined to be P(pdf) = 0.9, but here we propose to define the prior as configurable param for users, and by default we leave the prior to be unapplicable. Alternatively, we can define prior for each file type to be 1/[number of supported file types in Tika] I think the number would be approximately 1/1157 and using this number seems to be more fair, but the point of avoiding it is that this prior is fixed for every type, and eventually we care more the orders of the result and if the number is fixed, so will the order be, bringing this number of 1/1157 into the Bayesian equation will not only be unable to affect the order but also it will lumber our implementation with extra computation, thus we will leave it as unapplicable which means we assign 1 to it as it never exists! but note we care more the order rather the actual number, and this param is configurable, and we believe it provides much flexibilities in some use cases. Conditional probability of positive tests given a file type P(test| file_type) e.g. P(test1 = pdf | pdf), this probability is also based on collection of samples and domain or use cases, we leave it configurable, but based on our intuition we think test1(i.e. Magic-bytes method) is most trustworthy, thus the default value is 0.75 for P(test1 = a_file_type | a_file_type), this is to say given the file whose type is a file type, the probability of the test1 predicting the file is a_file_type is
[jira] [Commented] (TIKA-1535) Inheritance modification for the class MIMETypes
[ https://issues.apache.org/jira/browse/TIKA-1535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14295922#comment-14295922 ] Luke sh commented on TIKA-1535: --- TIKA-1517, the mime type selection mechanism with probability will be implemented by inheriting this class MIMEtypes, and MIMETypes is currently defined with final and some of its methods are tied with private modifier which does not allow for overwriting. Inheritance modification for the class MIMETypes Key: TIKA-1535 URL: https://issues.apache.org/jira/browse/TIKA-1535 Project: Tika Issue Type: Improvement Components: mime Reporter: Luke sh Priority: Trivial The Class MIMETypes does not currently allow for inheritance. There are a couple of methods in this class which looks independent, and some of which needs to be exposed or overwritten for special needs or use cases, this will enable tika users with more flexibility for new mime detection algorithm. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-1517) MIME type selection with probability
[ https://issues.apache.org/jira/browse/TIKA-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14295928#comment-14295928 ] Luke sh edited comment on TIKA-1517 at 1/28/15 11:06 PM: - the probability selection implementation will inherit the class MIMETypes, which needs to be modified by exposing some of its methods for being able to inherit and overwrite, TIKA-1535. was (Author: lukeliush): the probability selection will inherit the class MIMETypes, which needs to be modified by exposing some of its methods for being able to inherit and overwrite, TIKA-1535. MIME type selection with probability Key: TIKA-1517 URL: https://issues.apache.org/jira/browse/TIKA-1517 Project: Tika Issue Type: Improvement Components: mime Affects Versions: 0.1-incubating, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.10, 1.0, 1.1, 1.2, 1.3, 1.4, 1.5, 1.6 Reporter: Luke sh Attachments: BaysianTest.java Improvement and intuition The original implementation for MIME type selection/detection is a bit less flexible by initial design, as it heavily relies on the outcome produced by magic-bytes MIME Type identification; Thus e.g. if magic-bytes is applicable in a file, Tika will follow the file type detected by magic-bytes. It may be better to provide more control over the method of choice. This proposed approach slightly incorporate the Bayesian probability theorem, where users are able to assign weights to each approach in terms of probability, so they have the control over preference of which file type or mime type identification methods implemented/available in Tika, and currently there are 3 methods for identifying MIME type in Tika (i.e. Magic-Bytes, File extension and Metadata content-type hint). By introducing some weights on the approach in the proposed approach, users are able to choose which method they trust most, the magic-bytes method is often trust-worthy though. But the virtue is that in some situations, file type identification must be sensitive, some might want all of the MIME type identification methods to agree on the same file type before they start processing those files, incorrect file type identification is less intolerable. The current implementation seems to be less flexible for this purpose and heavily rely on the Magic-bytes file identification method (although magic-bytes is most reliable compared to the other 2 ); Proposed design: The idea of selection is to incorporate probability as weights on each MIME type identification method currently being implemented in Tika (they are Magic bytes approach, file extension match and metadata content-type hint). for example, as an user, i would probably like to assign the the preference to the method based on the degree of the trust, and order the results if they don't coincide. Bayesian rule may be a bit appropriate here to meet the intuition. The following is what are needed for Bayesian rule implementation. Prior probability P(file_type) e.g. P(pdf), theoretically this is computed based on the samples, and this depends on the domain or use cases, intuitively we more care the orders of the weights or probability of the results rather than the actual numbers, and also the context of Prior depends on samples for a particular use case or domain, e.g. if we happen to crawl a website that contains mostly the pdf files, we probably can collect some samples and compute the prior, based on the samples we can say 90% of docs are pdf, so our prior is defined to be P(pdf) = 0.9, but here we propose to define the prior as configurable param for users, and by default we leave the prior to be unapplicable. Alternatively, we can define prior for each file type to be 1/[number of supported file types in Tika] I think the number would be approximately 1/1157 and using this number seems to be more fair, but the point of avoiding it is that this prior is fixed for every type, and eventually we care more the orders of the result and if the number is fixed, so will the order be, bringing this number of 1/1157 into the Bayesian equation will not only be unable to affect the order but also it will lumber our implementation with extra computation, thus we will leave it as unapplicable which means we assign 1 to it as it never exists! but note we care more the order rather the actual number, and this param is configurable, and we believe it provides much flexibilities in some use cases. Conditional probability of positive tests given a file type P(test| file_type) e.g. P(test1 = pdf | pdf), this probability is also based on collection of samples and domain or use cases, we leave it configurable, but based on our intuition
[jira] [Commented] (TIKA-1521) Handle password protected 7zip files
[ https://issues.apache.org/jira/browse/TIKA-1521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14293944#comment-14293944 ] Luke sh commented on TIKA-1521: --- Hi @Nick Burch, I just came across this problem, i had a quick a look at it, I also run a quick test with the artifacts attached in the ticket, i think i was able to replicate the problem (i.e. https://builds.apache.org/job/tika-trunk-jdk1.6/425/org.apache.tika$tika-parsers/testReport/junit/org.apache.tika.parser.pkg/Seven7ParserTest/testPasswordProtected/ ), but i don't see this have anything to do with JDK version or platform version, the following is what i got. I think this might have something to do with the org.commons.compress.archivers.sevenZ.SevenZFile, or probably the way commons.compress is invoked. My platform is windows 8.1, the test is failing with both jdk 1.6 and jdk 1.8 java.lang.AssertionError: text.txt not found in: at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.assertTrue(Assert.java:41) at org.apache.tika.TikaTest.assertContains(TikaTest.java:85) at org.apache.tika.parser.pkg.Seven7ParserTest.testPasswordProtected(Seven7ParserTest.java:195) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57) at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) at org.junit.runners.ParentRunner.run(ParentRunner.java:363) at org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:50) at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197) Handle password protected 7zip files Key: TIKA-1521 URL: https://issues.apache.org/jira/browse/TIKA-1521 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.7 Reporter: Nick Burch Fix For: 1.8 While working on TIKA-1028, I notice that while Commons Compress doesn't currently handle decrypting password protected zip files, it does handle password protected 7zip files We should therefore add logic into the package parser to spot password protected 7zip files, and fetch the password for them from a PasswordProvider if given -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1517) MIME type detection with probability
[ https://issues.apache.org/jira/browse/TIKA-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14281254#comment-14281254 ] Luke sh commented on TIKA-1517: --- Basic feature design: Probability selection mechanism is incorporated only in the MIMEtypes detector in Tika, There are currently 4 detectors implemented in Tika, they are org.gagravarr.tika.OggDetector, org.apache.tika.parser.microsoft.POIFSContainerDetector, org.apache.tika.parser.pkg.ZipContainerDetector and lastly the org.apache.tika.mime.MimeTypes. Other than MimeTypes detector, the other 3 detectors make calls to some other open APIs in detecting the MIME types, if there is a inconsistent MIME type detected between detectors, Tika will probably make the choice with the type detected in the preferential order above. Inside the MimeTypes detector, the probability selection mechanism is invoked inside the detect() method, the method is called applyProbilities(ListMimeType:: possibleTypes, MimeType:: extMimeType, MimeType:: metadataMimeType). possibleTypes is a list of MimeTypes estimated by Magic-byte method. It is possible that Magic-bytes method estimate more than one types, it is also assumed the list maintain an order of precedence, the first element has higher weights. applyProbabilities method calculate the posterior probability for each file type estimated by each type detection method, and keep track the one that has higher posterior probability and eventually return it as result. it is also worth noting that a type might be a super type of another and they belong to the class of types, so at the beginning of the method, there is an extra procedure where the types from the same class will be reset to the one that is mostly specific. E.g. if type A is a super type of B, (and note A and B belong to the same class of type, but type A and type B could be different types estimated by different method), then A is reset to B, or vise-versa. Then for each type we compute the posterior probability with conditions on the types detected by each method. By default, the probability selection mechanism is disabled, in order to enable it, set useProbSelection:: boolean to true (the default value is false); To be continued MIME type detection with probability Key: TIKA-1517 URL: https://issues.apache.org/jira/browse/TIKA-1517 Project: Tika Issue Type: Improvement Components: mime Affects Versions: 0.1-incubating, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.10, 1.0, 1.1, 1.2, 1.3, 1.4, 1.5, 1.6 Reporter: Luke sh Attachments: BaysianTest.java Improvement and intuition The original implementation for MIME type selection/detection is a bit less flexible by initial design, as it heavily relies on the outcome produced by magic-bytes MIME Type identification; Thus e.g. if magic-bytes is applicable in a file, Tika will follow the file type detected by magic-bytes. It may be better to provide more control over the method of choice. This proposed approach slightly incorporate the Bayesian probability theorem, where users are able to assign weights to each approach in terms of probability, so they have the control over preference of which file type or mime type identification methods implemented/available in Tika, and currently there are 3 methods for identifying MIME type in Tika (i.e. Magic-Bytes, File extension and Metadata content-type hint). By introducing some weights on the approach in the proposed approach, users are able to choose which method they trust most, the magic-bytes method is often trust-worthy though. But the virtue is that in some situations, file type identification must be sensitive, some might want all of the MIME type identification methods to agree on the same file type before they start processing those files, incorrect file type identification is less intolerable. The current implementation seems to be less flexible for this purpose and heavily rely on the Magic-bytes file identification method (although magic-bytes is most reliable compared to the other 2 ); Proposed design: The idea of selection is to incorporate probability as weights on each MIME type identification method currently being implemented in Tika (they are Magic bytes approach, file extension match and metadata content-type hint). for example, as an user, i would probably like to assign the the preference to the method based on the degree of the trust, and order the results if they don't coincide. Bayesian rule may be a bit appropriate here to meet the intuition. The following is what are needed for Bayesian rule implementation. Prior probability P(file_type) e.g. P(pdf), theoretically this is computed based on the samples, and this depends on the domain or use cases,
[jira] [Updated] (TIKA-1517) MIME type detection with probability
[ https://issues.apache.org/jira/browse/TIKA-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luke sh updated TIKA-1517: -- Description: Improvement and intuition The original implementation for MIME type selection/detection is a bit less flexible by initial design, as it heavily relies on the outcome produced by magic-bytes MIME Type identification; Thus e.g. if magic-bytes is applicable in a file, Tika will follow the file type detected by magic-bytes. It may be better to provide more control over the method of choice. This proposed approach slightly incorporate the Bayesian probability theorem, where users are able to assign weights to each approach in terms of probability, so they have the control over preference of which file type or mime type identification methods implemented/available in Tika, and currently there are 3 methods for identifying MIME type in Tika (i.e. Magic-Bytes, File extension and Metadata content-type hint). By introducing some weights on the approach in the proposed approach, users are able to choose which method they trust most, the magic-bytes method is often trust-worthy though. But the virtue is that in some situations, file type identification must be sensitive, some might want all of the MIME type identification methods to agree on the same file type before they start processing those files, incorrect file type identification is less intolerable. The current implementation seems to be less flexible for this purpose and heavily rely on the Magic-bytes file identification method (although magic-bytes is most reliable compared to the other 2 ); Proposed design: The idea of selection is to incorporate probability as weights on each MIME type identification method currently being implemented in Tika (they are Magic bytes approach, file extension match and metadata content-type hint). for example, as an user, i would probably like to assign the the preference to the method based on the degree of the trust, and order the results if they don't coincide. Bayesian rule may be a bit appropriate here to meet the intuition. The following is what are needed for Bayesian rule implementation. Prior probability P(file_type) e.g. P(pdf), theoretically this is computed based on the samples, and this depends on the domain or use cases, intuitively we more care the orders of the weights or probability of the results rather than the actual numbers, and also the context of Prior depends on samples for a particular use case or domain, e.g. if we happen to crawl a website that contains mostly the pdf files, we probably can collect some samples and compute the prior, based on the samples we can say 90% of docs are pdf, so our prior is defined to be P(pdf) = 0.9, but here we propose to define the prior as configurable param for users, and by default we leave the prior to be unapplicable. Alternatively, we can define prior for each file type to be 1/[number of supported file types in Tika] I think the number would be approximately 1/1157 and using this number seems to be more fair, but the point of avoiding it is that this prior is fixed for every type, and eventually we care more the orders of the result and if the number is fixed, so will the order be, bringing this number of 1/1157 into the Bayesian equation will not only be unable to affect the order but also it will lumber our implementation with extra computation, thus we will leave it as unapplicable which means we assign 1 to it as it never exists! but note we care more the order rather the actual number, and this param is configurable, and we believe it provides much flexibilities in some use cases. Conditional probability of positive tests given a file type P(test| file_type) e.g. P(test1 = pdf | pdf), this probability is also based on collection of samples and domain or use cases, we leave it configurable, but based on our intuition we think test1(i.e. Magic-bytes method) is most trustworthy, thus the default value is 0.75 for P(test1 = a_file_type | a_file_type), this is to say given the file whose type is a file type, the probability of the test1 predicting the file is a_file_type is 0.75, that is really our intuition, as we trust test1 most, next we propose to use 0.7 for test3, and 0.65 for test2; (note again, test1 = magic-bytes, test2 = file extension, test3 = Metadata Content-type hint) Conditional probability of negative tests also need to be intuitively defined. E.g. By default, given a file type that is not pdf, the probability of test1 predicting it is pdf is 1-P(test1 = pdf | pdf), thus P(test1=pdf | ~pdf) = 1- 0.75 = 0.25, as we trust the test1 the most, the other tests are defined with 0.35 and 0.3 respectively with the same intuition. The goal is to find out P(file_type | test1 = file_type, test2=file_type, test3=file_type) (Please note, we are mostly interested in