[jira] [Commented] (TIKA-1533) PDF parse failing to capture right order of text (2 columns)
[ https://issues.apache.org/jira/browse/TIKA-1533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14295159#comment-14295159 ] Tim Allison commented on TIKA-1533: --- In the first document, printed page 303/pdf page 152 contains Tabell 5.7 - Tabell 5.9? I only see 362 on printed page 362 and in sammanlagt 362 frågor on printed page 88, pdf page 45. Have you run straight PDFBox's app with ExtractText to see if that is having the same issue as Tika? PDF parse failing to capture right order of text (2 columns) Key: TIKA-1533 URL: https://issues.apache.org/jira/browse/TIKA-1533 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.6, 1.7 Environment: Java 8, Mac OS X Reporter: Tamara When I am converting a document with two columns the order of the columns are inverted in the text file. I only could notice because it is an index list. The page I start to see the problem is the page 303, to look in the converted text look for 362. In the second file I have the same problem the page is 341. I have tried: setSortByPosition(true) and the columns got scrambled. I have tried to copy and paste from the pdf preview and the copy is as it should. And I have tried to use PDFXStream and it parses in the right way. Here are the files I have seen the issue: http://www.sbu.se/upload/Publikationer/Content0/1/Autismspektrumtillst%C3%A5nd_fulltext.pdf http://www.sbu.se/upload/publikationer/content0/1/forstamningssyndrom_fulltext.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-1533) PDF parse failing to capture right order of text (2 columns)
Tamara created TIKA-1533: Summary: PDF parse failing to capture right order of text (2 columns) Key: TIKA-1533 URL: https://issues.apache.org/jira/browse/TIKA-1533 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.7, 1.6 Environment: Java 8, Mac OS X Reporter: Tamara When I am converting a document with two columns the order of the columns are inverted in the text file. I only could notice because it is an index list. The page I start to see the problem is the page 303, to look in the converted text look for 362. In the second file I have the same problem the page is 341. I have tried: setSortByPosition(true) and the columns got scrambled. I have tried to copy and paste from the pdf preview and the copy is as it should. And I have tried to use PDFXStream and it parses in the right way. Here are the files I have seen the issue: http://www.sbu.se/upload/Publikationer/Content0/1/Autismspektrumtillst%C3%A5nd_fulltext.pdf http://www.sbu.se/upload/publikationer/content0/1/forstamningssyndrom_fulltext.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1517) MIME type selection with probability
[ https://issues.apache.org/jira/browse/TIKA-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14295928#comment-14295928 ] Luke sh commented on TIKA-1517: --- the probability selection will inherit the class MIMETypes, which needs to be modified by exposing some of its methods for being able to inherit and overwrite, TIKA-1535. MIME type selection with probability Key: TIKA-1517 URL: https://issues.apache.org/jira/browse/TIKA-1517 Project: Tika Issue Type: Improvement Components: mime Affects Versions: 0.1-incubating, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.10, 1.0, 1.1, 1.2, 1.3, 1.4, 1.5, 1.6 Reporter: Luke sh Attachments: BaysianTest.java Improvement and intuition The original implementation for MIME type selection/detection is a bit less flexible by initial design, as it heavily relies on the outcome produced by magic-bytes MIME Type identification; Thus e.g. if magic-bytes is applicable in a file, Tika will follow the file type detected by magic-bytes. It may be better to provide more control over the method of choice. This proposed approach slightly incorporate the Bayesian probability theorem, where users are able to assign weights to each approach in terms of probability, so they have the control over preference of which file type or mime type identification methods implemented/available in Tika, and currently there are 3 methods for identifying MIME type in Tika (i.e. Magic-Bytes, File extension and Metadata content-type hint). By introducing some weights on the approach in the proposed approach, users are able to choose which method they trust most, the magic-bytes method is often trust-worthy though. But the virtue is that in some situations, file type identification must be sensitive, some might want all of the MIME type identification methods to agree on the same file type before they start processing those files, incorrect file type identification is less intolerable. The current implementation seems to be less flexible for this purpose and heavily rely on the Magic-bytes file identification method (although magic-bytes is most reliable compared to the other 2 ); Proposed design: The idea of selection is to incorporate probability as weights on each MIME type identification method currently being implemented in Tika (they are Magic bytes approach, file extension match and metadata content-type hint). for example, as an user, i would probably like to assign the the preference to the method based on the degree of the trust, and order the results if they don't coincide. Bayesian rule may be a bit appropriate here to meet the intuition. The following is what are needed for Bayesian rule implementation. Prior probability P(file_type) e.g. P(pdf), theoretically this is computed based on the samples, and this depends on the domain or use cases, intuitively we more care the orders of the weights or probability of the results rather than the actual numbers, and also the context of Prior depends on samples for a particular use case or domain, e.g. if we happen to crawl a website that contains mostly the pdf files, we probably can collect some samples and compute the prior, based on the samples we can say 90% of docs are pdf, so our prior is defined to be P(pdf) = 0.9, but here we propose to define the prior as configurable param for users, and by default we leave the prior to be unapplicable. Alternatively, we can define prior for each file type to be 1/[number of supported file types in Tika] I think the number would be approximately 1/1157 and using this number seems to be more fair, but the point of avoiding it is that this prior is fixed for every type, and eventually we care more the orders of the result and if the number is fixed, so will the order be, bringing this number of 1/1157 into the Bayesian equation will not only be unable to affect the order but also it will lumber our implementation with extra computation, thus we will leave it as unapplicable which means we assign 1 to it as it never exists! but note we care more the order rather the actual number, and this param is configurable, and we believe it provides much flexibilities in some use cases. Conditional probability of positive tests given a file type P(test| file_type) e.g. P(test1 = pdf | pdf), this probability is also based on collection of samples and domain or use cases, we leave it configurable, but based on our intuition we think test1(i.e. Magic-bytes method) is most trustworthy, thus the default value is 0.75 for P(test1 = a_file_type | a_file_type), this is to say given the file whose type is a file type, the probability of the test1 predicting the file is a_file_type is
[jira] [Commented] (TIKA-1535) Inheritance modification for the class MIMETypes
[ https://issues.apache.org/jira/browse/TIKA-1535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14295922#comment-14295922 ] Luke sh commented on TIKA-1535: --- TIKA-1517, the mime type selection mechanism with probability will be implemented by inheriting this class MIMEtypes, and MIMETypes is currently defined with final and some of its methods are tied with private modifier which does not allow for overwriting. Inheritance modification for the class MIMETypes Key: TIKA-1535 URL: https://issues.apache.org/jira/browse/TIKA-1535 Project: Tika Issue Type: Improvement Components: mime Reporter: Luke sh Priority: Trivial The Class MIMETypes does not currently allow for inheritance. There are a couple of methods in this class which looks independent, and some of which needs to be exposed or overwritten for special needs or use cases, this will enable tika users with more flexibility for new mime detection algorithm. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1521) Handle password protected 7zip files
[ https://issues.apache.org/jira/browse/TIKA-1521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14295791#comment-14295791 ] Hudson commented on TIKA-1521: -- SUCCESS: Integrated in tika-trunk-jdk1.6 #442 (See [https://builds.apache.org/job/tika-trunk-jdk1.6/442/]) TIKA-1521: follow commons-compress and require installation of jce before testing password on 7z file (tallison: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1655431) * /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/pkg/PackageParser.java * /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/pkg/Seven7ParserTest.java Handle password protected 7zip files Key: TIKA-1521 URL: https://issues.apache.org/jira/browse/TIKA-1521 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.7 Reporter: Nick Burch Fix For: 1.8 While working on TIKA-1028, I notice that while Commons Compress doesn't currently handle decrypting password protected zip files, it does handle password protected 7zip files We should therefore add logic into the package parser to spot password protected 7zip files, and fetch the password for them from a PasswordProvider if given -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1534) Upgrade to Commons Compress 1.9
[ https://issues.apache.org/jira/browse/TIKA-1534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14295789#comment-14295789 ] Hudson commented on TIKA-1534: -- SUCCESS: Integrated in tika-trunk-jdk1.6 #442 (See [https://builds.apache.org/job/tika-trunk-jdk1.6/442/]) TIKA-1534: Upgrade to Commons Compress 1.9 (tallison: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1655433) * /tika/trunk/CHANGES.txt * /tika/trunk/tika-parsers/pom.xml Upgrade to Commons Compress 1.9 --- Key: TIKA-1534 URL: https://issues.apache.org/jira/browse/TIKA-1534 Project: Tika Issue Type: Improvement Reporter: Tim Allison Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1517) MIME type selection with probability
[ https://issues.apache.org/jira/browse/TIKA-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14296103#comment-14296103 ] Tyler Palsulich commented on TIKA-1517: --- Hi [~Lukeliush]. Thanks for raising this idea! Have you tested this probability selection mechanism on a set of files? Did the Mime Type detection improve? Instead of extending MimeTypes, you might be able to implement this as a standalone Detector which has an instance of MimeTypes it uses to get the Mime Type hints needed for your Bayesian system. Does that make sense? MIME type selection with probability Key: TIKA-1517 URL: https://issues.apache.org/jira/browse/TIKA-1517 Project: Tika Issue Type: Improvement Components: mime Affects Versions: 0.1-incubating, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.10, 1.0, 1.1, 1.2, 1.3, 1.4, 1.5, 1.6 Reporter: Luke sh Attachments: BaysianTest.java Improvement and intuition The original implementation for MIME type selection/detection is a bit less flexible by initial design, as it heavily relies on the outcome produced by magic-bytes MIME Type identification; Thus e.g. if magic-bytes is applicable in a file, Tika will follow the file type detected by magic-bytes. It may be better to provide more control over the method of choice. This proposed approach slightly incorporate the Bayesian probability theorem, where users are able to assign weights to each approach in terms of probability, so they have the control over preference of which file type or mime type identification methods implemented/available in Tika, and currently there are 3 methods for identifying MIME type in Tika (i.e. Magic-Bytes, File extension and Metadata content-type hint). By introducing some weights on the approach in the proposed approach, users are able to choose which method they trust most, the magic-bytes method is often trust-worthy though. But the virtue is that in some situations, file type identification must be sensitive, some might want all of the MIME type identification methods to agree on the same file type before they start processing those files, incorrect file type identification is less intolerable. The current implementation seems to be less flexible for this purpose and heavily rely on the Magic-bytes file identification method (although magic-bytes is most reliable compared to the other 2 ); Proposed design: The idea of selection is to incorporate probability as weights on each MIME type identification method currently being implemented in Tika (they are Magic bytes approach, file extension match and metadata content-type hint). for example, as an user, i would probably like to assign the the preference to the method based on the degree of the trust, and order the results if they don't coincide. Bayesian rule may be a bit appropriate here to meet the intuition. The following is what are needed for Bayesian rule implementation. Prior probability P(file_type) e.g. P(pdf), theoretically this is computed based on the samples, and this depends on the domain or use cases, intuitively we more care the orders of the weights or probability of the results rather than the actual numbers, and also the context of Prior depends on samples for a particular use case or domain, e.g. if we happen to crawl a website that contains mostly the pdf files, we probably can collect some samples and compute the prior, based on the samples we can say 90% of docs are pdf, so our prior is defined to be P(pdf) = 0.9, but here we propose to define the prior as configurable param for users, and by default we leave the prior to be unapplicable. Alternatively, we can define prior for each file type to be 1/[number of supported file types in Tika] I think the number would be approximately 1/1157 and using this number seems to be more fair, but the point of avoiding it is that this prior is fixed for every type, and eventually we care more the orders of the result and if the number is fixed, so will the order be, bringing this number of 1/1157 into the Bayesian equation will not only be unable to affect the order but also it will lumber our implementation with extra computation, thus we will leave it as unapplicable which means we assign 1 to it as it never exists! but note we care more the order rather the actual number, and this param is configurable, and we believe it provides much flexibilities in some use cases. Conditional probability of positive tests given a file type P(test| file_type) e.g. P(test1 = pdf | pdf), this probability is also based on collection of samples and domain or use cases, we leave it configurable, but based on our intuition we think test1(i.e. Magic-bytes method) is
[jira] [Comment Edited] (TIKA-1517) MIME type selection with probability
[ https://issues.apache.org/jira/browse/TIKA-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14296103#comment-14296103 ] Tyler Palsulich edited comment on TIKA-1517 at 1/29/15 12:04 AM: - Hi [~Lukeliush]. Thanks for raising this idea! Have you tested this probability selection mechanism on a set of files? Did the Mime Type detection improve? Instead of extending MimeTypes, you might be able to implement this as a standalone Detector. Your Detector could use an instance of MimeTypes to get the Mime Type hints needed for your Bayesian system. Does that make sense? was (Author: tpalsulich): Hi [~Lukeliush]. Thanks for raising this idea! Have you tested this probability selection mechanism on a set of files? Did the Mime Type detection improve? Instead of extending MimeTypes, you might be able to implement this as a standalone Detector which has an instance of MimeTypes it uses to get the Mime Type hints needed for your Bayesian system. Does that make sense? MIME type selection with probability Key: TIKA-1517 URL: https://issues.apache.org/jira/browse/TIKA-1517 Project: Tika Issue Type: Improvement Components: mime Affects Versions: 0.1-incubating, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.10, 1.0, 1.1, 1.2, 1.3, 1.4, 1.5, 1.6 Reporter: Luke sh Attachments: BaysianTest.java Improvement and intuition The original implementation for MIME type selection/detection is a bit less flexible by initial design, as it heavily relies on the outcome produced by magic-bytes MIME Type identification; Thus e.g. if magic-bytes is applicable in a file, Tika will follow the file type detected by magic-bytes. It may be better to provide more control over the method of choice. This proposed approach slightly incorporate the Bayesian probability theorem, where users are able to assign weights to each approach in terms of probability, so they have the control over preference of which file type or mime type identification methods implemented/available in Tika, and currently there are 3 methods for identifying MIME type in Tika (i.e. Magic-Bytes, File extension and Metadata content-type hint). By introducing some weights on the approach in the proposed approach, users are able to choose which method they trust most, the magic-bytes method is often trust-worthy though. But the virtue is that in some situations, file type identification must be sensitive, some might want all of the MIME type identification methods to agree on the same file type before they start processing those files, incorrect file type identification is less intolerable. The current implementation seems to be less flexible for this purpose and heavily rely on the Magic-bytes file identification method (although magic-bytes is most reliable compared to the other 2 ); Proposed design: The idea of selection is to incorporate probability as weights on each MIME type identification method currently being implemented in Tika (they are Magic bytes approach, file extension match and metadata content-type hint). for example, as an user, i would probably like to assign the the preference to the method based on the degree of the trust, and order the results if they don't coincide. Bayesian rule may be a bit appropriate here to meet the intuition. The following is what are needed for Bayesian rule implementation. Prior probability P(file_type) e.g. P(pdf), theoretically this is computed based on the samples, and this depends on the domain or use cases, intuitively we more care the orders of the weights or probability of the results rather than the actual numbers, and also the context of Prior depends on samples for a particular use case or domain, e.g. if we happen to crawl a website that contains mostly the pdf files, we probably can collect some samples and compute the prior, based on the samples we can say 90% of docs are pdf, so our prior is defined to be P(pdf) = 0.9, but here we propose to define the prior as configurable param for users, and by default we leave the prior to be unapplicable. Alternatively, we can define prior for each file type to be 1/[number of supported file types in Tika] I think the number would be approximately 1/1157 and using this number seems to be more fair, but the point of avoiding it is that this prior is fixed for every type, and eventually we care more the orders of the result and if the number is fixed, so will the order be, bringing this number of 1/1157 into the Bayesian equation will not only be unable to affect the order but also it will lumber our implementation with extra computation, thus we will leave it as unapplicable which means we assign 1 to it as it never
[jira] [Commented] (TIKA-1423) Build a parser to extract data from GRIB formats
[ https://issues.apache.org/jira/browse/TIKA-1423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14296129#comment-14296129 ] Lewis John McGibbney commented on TIKA-1423: I am working on this and think I have navigated the osgi + bundle problem we were having. I've just released edu.ucar thredds-parent, cdm, netCDF4 and grib 4.5.4 dependencies to Maven Central so I will try to complete this ticket again once the dependencies have flushed to the mirrors. Build a parser to extract data from GRIB formats Key: TIKA-1423 URL: https://issues.apache.org/jira/browse/TIKA-1423 Project: Tika Issue Type: New Feature Components: metadata, mime, parser Affects Versions: 1.6 Reporter: Vineet Ghatge Assignee: Vineet Ghatge Priority: Critical Labels: features, newbie Fix For: 1.8 Attachments: GRIBParsertest.java, GribParser.java, NLDAS_FORA0125_H.A20130112.1200.002.grb, TIKA-1423.palsulich.120614.patch, TIKA-1423.patch, fileName.html, gdas1.forecmwf.2014062612.grib2 Arctic dataset contains a MIME format called GRIB - General Regularlydistributed information in Binary form http://en.wikipedia.org/wiki/GRIB . GRIB is a well known data format which is a concise data format used in meteorology to store historical and weather data. There are 2 different types of the format GRIB 0, GRIB 2. The focus will be on GRIB 2 which is the most prevalent. Each GRIB record intended for either transmission or storage contains a single parameter with values located at an array of grid points, or represented as a set of spectral coefficients, for a single level (or layer), encoded as a continuous bit stream. Logical divisions of the record are designated as sections, each of which provides control information and/or data. A GRIB record consists of six sections, two of which are optional: (0) Indicator Section (1) Product Definition Section (PDS) (2) Grid Description Section (GDS) optional (3) Bit Map Section (BMS) optional (4) Binary Data Section (BDS) (5) '' (ASCII Characters) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-1517) MIME type selection with probability
[ https://issues.apache.org/jira/browse/TIKA-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14295928#comment-14295928 ] Luke sh edited comment on TIKA-1517 at 1/28/15 11:06 PM: - the probability selection implementation will inherit the class MIMETypes, which needs to be modified by exposing some of its methods for being able to inherit and overwrite, TIKA-1535. was (Author: lukeliush): the probability selection will inherit the class MIMETypes, which needs to be modified by exposing some of its methods for being able to inherit and overwrite, TIKA-1535. MIME type selection with probability Key: TIKA-1517 URL: https://issues.apache.org/jira/browse/TIKA-1517 Project: Tika Issue Type: Improvement Components: mime Affects Versions: 0.1-incubating, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.10, 1.0, 1.1, 1.2, 1.3, 1.4, 1.5, 1.6 Reporter: Luke sh Attachments: BaysianTest.java Improvement and intuition The original implementation for MIME type selection/detection is a bit less flexible by initial design, as it heavily relies on the outcome produced by magic-bytes MIME Type identification; Thus e.g. if magic-bytes is applicable in a file, Tika will follow the file type detected by magic-bytes. It may be better to provide more control over the method of choice. This proposed approach slightly incorporate the Bayesian probability theorem, where users are able to assign weights to each approach in terms of probability, so they have the control over preference of which file type or mime type identification methods implemented/available in Tika, and currently there are 3 methods for identifying MIME type in Tika (i.e. Magic-Bytes, File extension and Metadata content-type hint). By introducing some weights on the approach in the proposed approach, users are able to choose which method they trust most, the magic-bytes method is often trust-worthy though. But the virtue is that in some situations, file type identification must be sensitive, some might want all of the MIME type identification methods to agree on the same file type before they start processing those files, incorrect file type identification is less intolerable. The current implementation seems to be less flexible for this purpose and heavily rely on the Magic-bytes file identification method (although magic-bytes is most reliable compared to the other 2 ); Proposed design: The idea of selection is to incorporate probability as weights on each MIME type identification method currently being implemented in Tika (they are Magic bytes approach, file extension match and metadata content-type hint). for example, as an user, i would probably like to assign the the preference to the method based on the degree of the trust, and order the results if they don't coincide. Bayesian rule may be a bit appropriate here to meet the intuition. The following is what are needed for Bayesian rule implementation. Prior probability P(file_type) e.g. P(pdf), theoretically this is computed based on the samples, and this depends on the domain or use cases, intuitively we more care the orders of the weights or probability of the results rather than the actual numbers, and also the context of Prior depends on samples for a particular use case or domain, e.g. if we happen to crawl a website that contains mostly the pdf files, we probably can collect some samples and compute the prior, based on the samples we can say 90% of docs are pdf, so our prior is defined to be P(pdf) = 0.9, but here we propose to define the prior as configurable param for users, and by default we leave the prior to be unapplicable. Alternatively, we can define prior for each file type to be 1/[number of supported file types in Tika] I think the number would be approximately 1/1157 and using this number seems to be more fair, but the point of avoiding it is that this prior is fixed for every type, and eventually we care more the orders of the result and if the number is fixed, so will the order be, bringing this number of 1/1157 into the Bayesian equation will not only be unable to affect the order but also it will lumber our implementation with extra computation, thus we will leave it as unapplicable which means we assign 1 to it as it never exists! but note we care more the order rather the actual number, and this param is configurable, and we believe it provides much flexibilities in some use cases. Conditional probability of positive tests given a file type P(test| file_type) e.g. P(test1 = pdf | pdf), this probability is also based on collection of samples and domain or use cases, we leave it configurable, but based on our intuition
RE: [jira] [Commented] (TIKA-1535) Inheritance modification for the class MIMETypes
Hi Professor and all, Bayesian or machine learning Detector is different from Bayesian Selection mechanism reported in TIKA-1517. It would make sense if we implemented a machine learning algorithm in separate Detector class, I have not gone too far with this design thought, as I am still on the stage of the research with data collection, once I have enough data and am able to form a model, especially I am able to prove my concept, then I will be able to come down to the machine learning detector implementation with design consideration. (BTW, I think I have some ideas with data collection and training, it still takes some time to come up with something even quick and dirty that can prove the concept with machine learning, I am still working on the data collection, there are also some design problems within learning techniques too, I will come to them once I will have clear idea with the data, i think I may have to crawl the data and label them for training, there are some certain preprocessing steps to be cared too) However, my current implementation in TIKA-1517 is solely based on mime type selection(I cannot find any clearer name disguisable from detection) with probability that might have nothing to do with the genuine machine learning detector, it is a feature for adding weights to each Tika mime type detection algorithm. But I think you are right, and in the future we kinda need it to assign weights to a pool of detection algorithms including machine learning techniques or content based detection algorithms, and the current implementation of MIMEtypes with final has its design purpose, and I don’t think it is a good idea to lump detector code within the MimeTypes, but I will come down to this design or architecture problem once I have some clear ideas of the machine learning model (not necessary Bayesian model for detection). BTW, off the top of my head, I would tend to distill the detector semantics out of the MIMEtypes mentioned as below; What do you think about creating a say TikaDetector class independent from the MimeTypes, and get rid of MimeTypes from the detectors (i.e. getting rid of the implements Detector in the MimeTypes)? I will continue to think about this design problem as we move alone, and I will leave notes on the ticket for sure. It looks like an important or big change, so any kind suggestion will be welcomed and appreciated. Thanks Luke -Original Message- From: Christian Alan Mattmann [mailto:mattm...@usc.edu] Sent: Wednesday, January 28, 2015 6:30 PM To: Luke; 'Mattmann, Chris A (3980)' Cc: nsf-polar-usc-stude...@googlegroups.com Subject: Re: [jira] [Commented] (TIKA-1535) Inheritance modification for the class MIMETypes Hi Luke, thanks much. I think we should be having this discussion on the dev@tika.apache.org list too, but thanks also for CC’ing the Polar students list. My feeling is that Tyler has a good point and that having a BayesianDetector makes a ton of sense. How about we try that as a start, and see where it goes? Cheers, Chris Chris Mattmann, Ph.D. Adjunct Associate Professor, Computer Science Department University of Southern California Los Angeles, CA 90089 USA Email: mattm...@usc.edu WWW: http://sunset.usc.edu/~mattmann/ -Original Message- From: Luke hanson311...@gmail.com Date: Wednesday, January 28, 2015 at 5:48 PM To: Chris Mattmann chris.a.mattm...@jpl.nasa.gov Cc: Chris Mattmann mattm...@usc.edu, NSF Polar CyberInfrastructure DR Students nsf-polar-usc-stude...@googlegroups.com Subject: FW: [jira] [Commented] (TIKA-1535) Inheritance modification for the class MIMETypes Hi Professor, I was about to modify the code to be able to work with inheritance and code reuse, Tyler in the following just came across and posted a suggestion, which is a bit enlightening. Defining class with final in this case seems to tell me that any input stream that gets passed to the class is attached to one fixed type of MimeTypes (I tend to think the MimeTypes should be tied up with one input stream), or it can be interpreted it as the MimeTypes of an input stream. If we inherit this by calling my implementation of MimeTypesBaysianSelection, that will look weird in a sense of inheritance. As my Bayesian implementation is more like an operation attached to that input stream's MimeTypes. It seems MimeTypes class is not only used as a MimeType detector (it implements Detector interface though), but it also has some other purposes, eg. Users can take a peak on the input stream mimetypes, extension, magics, etc, that is probably why it is called MimeTypes rather than something like Detector; I think it is not a detector, but some of its methods such as getMagics or something make it easier fit into the slot of Detectors, as it is easier to just outfit it with an Detector interface and
Re: [jira] [Commented] (TIKA-1535) Inheritance modification for the class MIMETypes
Hi Luke, -Original Message- From: Luke hanson311...@gmail.com Date: Wednesday, January 28, 2015 at 7:15 PM To: Chris Mattmann mattm...@usc.edu, Chris Mattmann chris.a.mattm...@jpl.nasa.gov, dev@tika.apache.org dev@tika.apache.org Cc: NSF Polar CyberInfrastructure DR Students nsf-polar-usc-stude...@googlegroups.com Subject: RE: [jira] [Commented] (TIKA-1535) Inheritance modification for the class MIMETypes Hi Professor and all, Bayesian or machine learning Detector is different from Bayesian Selection mechanism reported in TIKA-1517. It would make sense if we implemented a machine learning algorithm in separate Detector class, I have not gone too far with this design thought, as I am still on the stage of the research with data collection, once I have enough data and am able to form a model, especially I am able to prove my concept, then I will be able to come down to the machine learning detector implementation with design consideration. (BTW, I think I have some ideas with data collection and training, it still takes some time to come up with something even quick and dirty that can prove the concept with machine learning, I am still working on the data collection, there are also some design problems within learning techniques too, I will come to them once I will have clear idea with the data, i think I may have to crawl the data and label them for training, there are some certain preprocessing steps to be cared too) +1. However, my current implementation in TIKA-1517 is solely based on mime type selection(I cannot find any clearer name disguisable from detection) with probability that might have nothing to do with the genuine machine learning detector, it is a feature for adding weights to each Tika mime type detection algorithm. Gotcha. But I think you are right, and in the future we kinda need it to assign weights to a pool of detection algorithms including machine learning techniques or content based detection algorithms, and the current implementation of MIMEtypes with final has its design purpose, and I don’t think it is a good idea to lump detector code within the MimeTypes, but I will come down to this design or architecture problem once I have some clear ideas of the machine learning model (not necessary Bayesian model for detection). BTW, off the top of my head, I would tend to distill the detector semantics out of the MIMEtypes mentioned as below; What do you think about creating a say TikaDetector class independent from the MimeTypes, and get rid of MimeTypes from the detectors (i.e. getting rid of the implements Detector in the MimeTypes)? Yes, can you explore doing this? I will continue to think about this design problem as we move alone, and I will leave notes on the ticket for sure. It looks like an important or big change, so any kind suggestion will be welcomed and appreciate Thank you Luke, will do. I will read more and comment on it. Thanks for sharing this with the list! Cheers, Chris ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Christian Alan Mattmann [mailto:mattm...@usc.edu] Sent: Wednesday, January 28, 2015 6:30 PM To: Luke; 'Mattmann, Chris A (3980)' Cc: nsf-polar-usc-stude...@googlegroups.com Subject: Re: [jira] [Commented] (TIKA-1535) Inheritance modification for the class MIMETypes Hi Luke, thanks much. I think we should be having this discussion on the dev@tika.apache.org list too, but thanks also for CC’ing the Polar students list. My feeling is that Tyler has a good point and that having a BayesianDetector makes a ton of sense. How about we try that as a start, and see where it goes? Cheers, Chris Chris Mattmann, Ph.D. Adjunct Associate Professor, Computer Science Department University of Southern California Los Angeles, CA 90089 USA Email: mattm...@usc.edu WWW: http://sunset.usc.edu/~mattmann/ -Original Message- From: Luke hanson311...@gmail.com Date: Wednesday, January 28, 2015 at 5:48 PM To: Chris Mattmann chris.a.mattm...@jpl.nasa.gov Cc: Chris Mattmann mattm...@usc.edu, NSF Polar CyberInfrastructure DR Students nsf-polar-usc-stude...@googlegroups.com Subject: FW: [jira] [Commented] (TIKA-1535) Inheritance modification for the class MIMETypes Hi Professor, I was about to modify the code to be able to work with inheritance and code reuse, Tyler in
[jira] [Comment Edited] (TIKA-1518) Docker with Tika Server
[ https://issues.apache.org/jira/browse/TIKA-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14296439#comment-14296439 ] Chris A. Mattmann edited comment on TIKA-1518 at 1/29/15 6:15 AM: -- Thanks Tyler. Can you raise #2 on infrastruct...@apache.org? That would be an awesome idea, and then keep folks here posted. As for #1, +1 from me. RE: #3, there is a TIKA issue on that, I think it's https://issues.apache.org/jira/browse/TIKA-1302 was (Author: chrismattmann): Thanks Tyler. Can you raise #2 on infrastruct...@apache.org? That would be an awesome idea, and then keep folks here posted. As for #1, +1 from me. RE: #3, there is a TIKA issue on that, I think it's TIKA-1312 Docker with Tika Server --- Key: TIKA-1518 URL: https://issues.apache.org/jira/browse/TIKA-1518 Project: Tika Issue Type: New Feature Reporter: Paul Ramirez Fix For: 1.8 This version should be able to demonstrate as many of Apache Tika's capabilities as possible. For instance with GDAL, Tesseract, and FFmpeg to show parsers which require installation of other dependencies. In addition, this should help move TIKA-1301 forward and should leverage the suggestion made by [~lewismc] of a script which can pull down the latest version of Apache Tika. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1518) Docker with Tika Server
[ https://issues.apache.org/jira/browse/TIKA-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14296439#comment-14296439 ] Chris A. Mattmann commented on TIKA-1518: - Thanks Tyler. Can you raise #2 on infrastruct...@apache.org? That would be an awesome idea, and then keep folks here posted. As for #1, +1 from me. RE: #3, there is a TIKA issue on that, I think it's TIKA-1312 Docker with Tika Server --- Key: TIKA-1518 URL: https://issues.apache.org/jira/browse/TIKA-1518 Project: Tika Issue Type: New Feature Reporter: Paul Ramirez Fix For: 1.8 This version should be able to demonstrate as many of Apache Tika's capabilities as possible. For instance with GDAL, Tesseract, and FFmpeg to show parsers which require installation of other dependencies. In addition, this should help move TIKA-1301 forward and should leverage the suggestion made by [~lewismc] of a script which can pull down the latest version of Apache Tika. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-1518) Docker with Tika Server
[ https://issues.apache.org/jira/browse/TIKA-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14296439#comment-14296439 ] Chris A. Mattmann edited comment on TIKA-1518 at 1/29/15 6:15 AM: -- Thanks Tyler. Can you raise #2 on infrastruct...@apache.org? That would be an awesome idea, and then keep folks here posted. As for #1, +1 from me. RE: #3, there is a TIKA issue on that, I think it's TIKA-1302 was (Author: chrismattmann): Thanks Tyler. Can you raise #2 on infrastruct...@apache.org? That would be an awesome idea, and then keep folks here posted. As for #1, +1 from me. RE: #3, there is a TIKA issue on that, I think it's https://issues.apache.org/jira/browse/TIKA-1302 Docker with Tika Server --- Key: TIKA-1518 URL: https://issues.apache.org/jira/browse/TIKA-1518 Project: Tika Issue Type: New Feature Reporter: Paul Ramirez Fix For: 1.8 This version should be able to demonstrate as many of Apache Tika's capabilities as possible. For instance with GDAL, Tesseract, and FFmpeg to show parsers which require installation of other dependencies. In addition, this should help move TIKA-1301 forward and should leverage the suggestion made by [~lewismc] of a script which can pull down the latest version of Apache Tika. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-1423) Build a parser to extract data from GRIB formats
[ https://issues.apache.org/jira/browse/TIKA-1423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14296541#comment-14296541 ] Lewis John McGibbney edited comment on TIKA-1423 at 1/29/15 7:54 AM: - Patch for trunk which passes all tests including issues experienced with bundle module. Some investigative work was required here as well as publishing [Unidata dependencies|http://search.maven.org/#search|ga|1|ucar] to Maven central and updating our [wiki documentation|https://wiki.apache.org/tika/ThirdPartySonaType]. Please insert the .grib file into {code} tika-parsers/src/test/resources/test-documents/gdas1.forecmwf.2014062612.grib2 {code} was (Author: lewismc): Patch for trunk which passes all tests including issues experienced with bundle module. Some investigative work was required here as well as publishing [Unidata dependencies|http://search.maven.org/#search|ga|1|ucar] to Maven central and updating our [https://wiki.apache.org/tika/ThirdPartySonaType|wiki documentation]. Build a parser to extract data from GRIB formats Key: TIKA-1423 URL: https://issues.apache.org/jira/browse/TIKA-1423 Project: Tika Issue Type: New Feature Components: metadata, mime, parser Affects Versions: 1.6 Reporter: Vineet Ghatge Assignee: Vineet Ghatge Priority: Critical Labels: features, newbie Fix For: 1.8 Attachments: GRIBParsertest.java, GribParser.java, NLDAS_FORA0125_H.A20130112.1200.002.grb, TIKA-1423.palsulich.120614.patch, TIKA-1423.patch, TIKA-1423v2.patch, fileName.html, gdas1.forecmwf.2014062612.grib2 Arctic dataset contains a MIME format called GRIB - General Regularlydistributed information in Binary form http://en.wikipedia.org/wiki/GRIB . GRIB is a well known data format which is a concise data format used in meteorology to store historical and weather data. There are 2 different types of the format GRIB 0, GRIB 2. The focus will be on GRIB 2 which is the most prevalent. Each GRIB record intended for either transmission or storage contains a single parameter with values located at an array of grid points, or represented as a set of spectral coefficients, for a single level (or layer), encoded as a continuous bit stream. Logical divisions of the record are designated as sections, each of which provides control information and/or data. A GRIB record consists of six sections, two of which are optional: (0) Indicator Section (1) Product Definition Section (PDS) (2) Grid Description Section (GDS) optional (3) Bit Map Section (BMS) optional (4) Binary Data Section (BDS) (5) '' (ASCII Characters) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: multiple detect call - different results (tika 1.7)
Dear Gabriele, Thanks for your question. It should be sent to dev@tika.apache.org (moving dev-ow...@tika.apache.org to BCC). I’ll take a look tomorrow if someone else hasn’t answered yet. Cheers, Chris ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Gabriele Guidi gabriele.gu...@eng.it Date: Wednesday, January 28, 2015 at 5:25 AM To: dev-ow...@tika.apache.org dev-ow...@tika.apache.org Subject: multiple detect call - different results (tika 1.7) Hi, I found a strange behavior. I have p7m file, then I extract file inside the signed one, after that I use tika to discover mime type, the first call it gives me application/pdf (that's correct). BUT every next call to the detect method of Tika to the same inputStream gives me application/octet-stream. ...why? I cannot understand the behavior ...and find a solution. Just a snipped of code: InputStream inputsbust = content.getContentStream(); System.out.println( 1 mime + filepath + : + tika.detect(inputsbust)); System.out.println( 2 mime + filepath + : + tika.detect(inputsbust)); System.out.println( 3 mime + filepath + : + tika.detect(inputsbust)); Result: 1 mime /home/gguidi/01_file.pdf : application/pdf 2 mime /home/gguidi/01_file.pdf : application/octet-stream 3 mime /home/gguidi/01_file.pdf : application/octet-stream Thanks -- Gabriele Guidi Direzione Pubblica Amministrazione gabriele.gu...@eng.it Engineering Ingegneria Informatica spa Via Marconi, 10 - 40122, Bologna Tel. +39-051.0435135 www.eng.it http://www.eng.it Rispetta l'ambiente. Non stampare questa e-mail se non necessario. Respect the environment. Please don't print this e-mail unless you really need to. Le informazioni trasmesse sono destinate esclusivamente alla persona o alla società in indirizzo e sono da intendersi confidenziali e riservate. Ogni trasmissione, inoltro, diffusione o altro uso di queste informazioni a persone o società differenti dal destinatario è proibita. Se ricevete questa comunicazione per errore, contattate il mittente e cancellate le informazioni da ogni computer. The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this in error, please contact the sender and delete the material from any computer.
RE: [jira] [Commented] (TIKA-1535) Inheritance modification for the class MIMETypes
Thanks professor for the prompt and kind response, will keep you updated on the progress and findings. -Original Message- From: Mattmann, Chris A (3980) [mailto:chris.a.mattm...@jpl.nasa.gov] Sent: Wednesday, January 28, 2015 8:17 PM To: Luke; 'Christian Alan Mattmann'; dev@tika.apache.org Cc: nsf-polar-usc-stude...@googlegroups.com Subject: Re: [jira] [Commented] (TIKA-1535) Inheritance modification for the class MIMETypes Hi Luke, -Original Message- From: Luke hanson311...@gmail.com Date: Wednesday, January 28, 2015 at 7:15 PM To: Chris Mattmann mattm...@usc.edu, Chris Mattmann chris.a.mattm...@jpl.nasa.gov, dev@tika.apache.org dev@tika.apache.org Cc: NSF Polar CyberInfrastructure DR Students nsf-polar-usc-stude...@googlegroups.com Subject: RE: [jira] [Commented] (TIKA-1535) Inheritance modification for the class MIMETypes Hi Professor and all, Bayesian or machine learning Detector is different from Bayesian Selection mechanism reported in TIKA-1517. It would make sense if we implemented a machine learning algorithm in separate Detector class, I have not gone too far with this design thought, as I am still on the stage of the research with data collection, once I have enough data and am able to form a model, especially I am able to prove my concept, then I will be able to come down to the machine learning detector implementation with design consideration. (BTW, I think I have some ideas with data collection and training, it still takes some time to come up with something even quick and dirty that can prove the concept with machine learning, I am still working on the data collection, there are also some design problems within learning techniques too, I will come to them once I will have clear idea with the data, i think I may have to crawl the data and label them for training, there are some certain preprocessing steps to be cared too) +1. However, my current implementation in TIKA-1517 is solely based on mime type selection(I cannot find any clearer name disguisable from detection) with probability that might have nothing to do with the genuine machine learning detector, it is a feature for adding weights to each Tika mime type detection algorithm. Gotcha. But I think you are right, and in the future we kinda need it to assign weights to a pool of detection algorithms including machine learning techniques or content based detection algorithms, and the current implementation of MIMEtypes with final has its design purpose, and I don’t think it is a good idea to lump detector code within the MimeTypes, but I will come down to this design or architecture problem once I have some clear ideas of the machine learning model (not necessary Bayesian model for detection). BTW, off the top of my head, I would tend to distill the detector semantics out of the MIMEtypes mentioned as below; What do you think about creating a say TikaDetector class independent from the MimeTypes, and get rid of MimeTypes from the detectors (i.e. getting rid of the implements Detector in the MimeTypes)? Yes, can you explore doing this? I will continue to think about this design problem as we move alone, and I will leave notes on the ticket for sure. It looks like an important or big change, so any kind suggestion will be welcomed and appreciate Thank you Luke, will do. I will read more and comment on it. Thanks for sharing this with the list! Cheers, Chris ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Christian Alan Mattmann [mailto:mattm...@usc.edu] Sent: Wednesday, January 28, 2015 6:30 PM To: Luke; 'Mattmann, Chris A (3980)' Cc: nsf-polar-usc-stude...@googlegroups.com Subject: Re: [jira] [Commented] (TIKA-1535) Inheritance modification for the class MIMETypes Hi Luke, thanks much. I think we should be having this discussion on the dev@tika.apache.org list too, but thanks also for CC’ing the Polar students list. My feeling is that Tyler has a good point and that having a BayesianDetector makes a ton of sense. How about we try that as a start, and see where it goes? Cheers, Chris Chris Mattmann, Ph.D. Adjunct Associate Professor, Computer Science Department University of Southern California Los Angeles, CA 90089 USA Email: mattm...@usc.edu WWW: http://sunset.usc.edu/~mattmann/
[jira] [Updated] (TIKA-1423) Build a parser to extract data from GRIB formats
[ https://issues.apache.org/jira/browse/TIKA-1423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated TIKA-1423: --- Attachment: TIKA-1423v2.patch Patch for trunk which passes all tests including issues experienced with bundle module. Some investigative work was required here as well as publishing [Unidata dependencies|http://search.maven.org/#search|ga|1|ucar] to Maven central and updating our [https://wiki.apache.org/tika/ThirdPartySonaType|wiki documentation]. Build a parser to extract data from GRIB formats Key: TIKA-1423 URL: https://issues.apache.org/jira/browse/TIKA-1423 Project: Tika Issue Type: New Feature Components: metadata, mime, parser Affects Versions: 1.6 Reporter: Vineet Ghatge Assignee: Vineet Ghatge Priority: Critical Labels: features, newbie Fix For: 1.8 Attachments: GRIBParsertest.java, GribParser.java, NLDAS_FORA0125_H.A20130112.1200.002.grb, TIKA-1423.palsulich.120614.patch, TIKA-1423.patch, TIKA-1423v2.patch, fileName.html, gdas1.forecmwf.2014062612.grib2 Arctic dataset contains a MIME format called GRIB - General Regularlydistributed information in Binary form http://en.wikipedia.org/wiki/GRIB . GRIB is a well known data format which is a concise data format used in meteorology to store historical and weather data. There are 2 different types of the format GRIB 0, GRIB 2. The focus will be on GRIB 2 which is the most prevalent. Each GRIB record intended for either transmission or storage contains a single parameter with values located at an array of grid points, or represented as a set of spectral coefficients, for a single level (or layer), encoded as a continuous bit stream. Logical divisions of the record are designated as sections, each of which provides control information and/or data. A GRIB record consists of six sections, two of which are optional: (0) Indicator Section (1) Product Definition Section (PDS) (2) Grid Description Section (GDS) optional (3) Bit Map Section (BMS) optional (4) Binary Data Section (BDS) (5) '' (ASCII Characters) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-1534) Upgrade to Commons Compress 1.9
Tim Allison created TIKA-1534: - Summary: Upgrade to Commons Compress 1.9 Key: TIKA-1534 URL: https://issues.apache.org/jira/browse/TIKA-1534 Project: Tika Issue Type: Improvement Reporter: Tim Allison Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1534) Upgrade to Commons Compress 1.9
[ https://issues.apache.org/jira/browse/TIKA-1534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14295749#comment-14295749 ] Hudson commented on TIKA-1534: -- SUCCESS: Integrated in tika-trunk-jdk1.7 #457 (See [https://builds.apache.org/job/tika-trunk-jdk1.7/457/]) TIKA-1534: Upgrade to Commons Compress 1.9 (tallison: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1655433) * /tika/trunk/CHANGES.txt * /tika/trunk/tika-parsers/pom.xml Upgrade to Commons Compress 1.9 --- Key: TIKA-1534 URL: https://issues.apache.org/jira/browse/TIKA-1534 Project: Tika Issue Type: Improvement Reporter: Tim Allison Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Reopened] (TIKA-1329) Add RecursiveParserWrapper aka Jukka's (and Nick's) RecursiveMetadataParser
[ https://issues.apache.org/jira/browse/TIKA-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison reopened TIKA-1329: --- Wait, do I need to update the webpage, too? Or is that done automatically from tika-examples? Add RecursiveParserWrapper aka Jukka's (and Nick's) RecursiveMetadataParser --- Key: TIKA-1329 URL: https://issues.apache.org/jira/browse/TIKA-1329 Project: Tika Issue Type: Sub-task Components: parser Reporter: Tim Allison Priority: Minor Fix For: 1.8 Attachments: TIKA-1329v2.patch, test_recursive_embedded.docx Jukka and Nick have a great demo of parsing metadata recursively on the [wiki|http://wiki.apache.org/tika/RecursiveMetadata]. For TIKA-1302, I'd like to use something similar, and I think that others may find it useful for tika-app and tika-server. I took the code from the wiki and made some modifications. I'm not sure if we should put this in parsers or in a new module for examples. Given that I think this would be useful for tika-app and tika-server, I'd prefer parsers, but I'm open to any input...including let's not. -- This message was sent by Atlassian JIRA (v6.3.4#6332)