[jira] [Reopened] (TIKA-2623) get embedded resources in PDF/doc files
[ https://issues.apache.org/jira/browse/TIKA-2623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ohad R reopened TIKA-2623: -- keep this "open" until merge https://github.com/apache/tika/pull/233 > get embedded resources in PDF/doc files > --- > > Key: TIKA-2623 > URL: https://issues.apache.org/jira/browse/TIKA-2623 > Project: Tika > Issue Type: Improvement > Components: cli, core, parser >Reporter: Ohad R >Priority: Trivial > Fix For: 1.18 > > > The motivation: support embedded files in PDF, Word's doc/docx, etc. > according to > [https://stackoverflow.com/questions/20172465/get-embedded-resourses-in-doc-files-using-apache-tika,] > it is possible to recursively parse a document and save its sub-items (e.g. > images) in a folder thanks to FileEmbeddedDocumentExtractor. However, the > scope of the above class is only in the TikaCLI. > I think it should be visible to the applications that uses Tika (not only to > the CLI) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (TIKA-2623) get embedded resources in PDF/doc files
[ https://issues.apache.org/jira/browse/TIKA-2623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ohad R resolved TIKA-2623. -- Resolution: Fixed Fix Version/s: 1.18 [https://github.com/apache/tika/pull/233] > get embedded resources in PDF/doc files > --- > > Key: TIKA-2623 > URL: https://issues.apache.org/jira/browse/TIKA-2623 > Project: Tika > Issue Type: Improvement > Components: cli, core, parser >Reporter: Ohad R >Priority: Trivial > Fix For: 1.18 > > > The motivation: support embedded files in PDF, Word's doc/docx, etc. > according to > [https://stackoverflow.com/questions/20172465/get-embedded-resourses-in-doc-files-using-apache-tika,] > it is possible to recursively parse a document and save its sub-items (e.g. > images) in a folder thanks to FileEmbeddedDocumentExtractor. However, the > scope of the above class is only in the TikaCLI. > I think it should be visible to the applications that uses Tika (not only to > the CLI) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (TIKA-2623) get embedded resources in PDF/doc files
[ https://issues.apache.org/jira/browse/TIKA-2623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ohad R updated TIKA-2623: - Description: The motivation: support embedded files in PDF, Word's doc/docx, etc. according to [https://stackoverflow.com/questions/20172465/get-embedded-resourses-in-doc-files-using-apache-tika,] it is possible to recursively parse a document and save its sub-items (e.g. images) in a folder thanks to FileEmbeddedDocumentExtractor. However, the scope of the above class is only in the TikaCLI. I think it should be visible to the applications that uses Tika (not only to the CLI) was: according to [https://stackoverflow.com/questions/20172465/get-embedded-resourses-in-doc-files-using-apache-tika,] it is possible to recursively parse a document and save its sub-items (e.g. images) in a folder thanks to FileEmbeddedDocumentExtractor. However, the scope of the above class is only in the TikaCLI. I think it should be visible to the applications that uses Tika (not only to the CLI) > get embedded resources in PDF/doc files > --- > > Key: TIKA-2623 > URL: https://issues.apache.org/jira/browse/TIKA-2623 > Project: Tika > Issue Type: Improvement > Components: cli, core, parser >Reporter: Ohad R >Priority: Trivial > > The motivation: support embedded files in PDF, Word's doc/docx, etc. > according to > [https://stackoverflow.com/questions/20172465/get-embedded-resourses-in-doc-files-using-apache-tika,] > it is possible to recursively parse a document and save its sub-items (e.g. > images) in a folder thanks to FileEmbeddedDocumentExtractor. However, the > scope of the above class is only in the TikaCLI. > I think it should be visible to the applications that uses Tika (not only to > the CLI) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (TIKA-2623) get embedded resources in PDF/doc files
[ https://issues.apache.org/jira/browse/TIKA-2623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ohad R updated TIKA-2623: - Summary: get embedded resources in PDF/doc files (was: get embedded resources in doc files) > get embedded resources in PDF/doc files > --- > > Key: TIKA-2623 > URL: https://issues.apache.org/jira/browse/TIKA-2623 > Project: Tika > Issue Type: Improvement > Components: cli, core, parser >Reporter: Ohad R >Priority: Trivial > > according to > [https://stackoverflow.com/questions/20172465/get-embedded-resourses-in-doc-files-using-apache-tika,] > it is possible to recursively parse a document and save its sub-items (e.g. > images) in a folder thanks to FileEmbeddedDocumentExtractor. However, the > scope of the above class is only in the TikaCLI. > I think it should be visible to the applications that uses Tika (not only to > the CLI) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (TIKA-2623) get embedded resources in doc files
[ https://issues.apache.org/jira/browse/TIKA-2623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16425042#comment-16425042 ] Ohad R edited comment on TIKA-2623 at 4/4/18 6:00 AM: -- "We don't want to add huge numbers of extra dependencies to Tika Core" - I agree, this is why I've placed the file under tika-parser, which is already dependant on apache-io and poi. "As per the linked StackOverflow post, you can just write a few lines of Java yourself to do the saving in a similar way to the CLI, can you not just do that?" - I am not completely sure the code in S.O. works for all cases - PDFs, Office docs, etc. If it is that simple, why do we need all the functionality of '{{FileEmbeddedDocumentExtractor}} ' in the CLI, with all the dedicated-periphrial code such as {{class OutputType?}} in his answer, he said "*The best example I can think of for this is in the Tika CLI, as used by the {{-z}} (extract) flag. ...you're looking for the {{FileEmbeddedDocumentExtractor}} as your example.*" then he gives the simplest code - but this code seems partial and not "production ready" as {{FileEmbeddedDocumentExtractor}} please advise... was (Author: ohadr): "We don't want to add huge numbers of extra dependencies to Tika Core" - I agree, this is why I've placed the file under tika-parser, which is already dependant on apache-io and poi. "As per the linked StackOverflow post, you can just write a few lines of Java yourself to do the saving in a similar way to the CLI, can you not just do that?" - I am not completely sure the code in S.O. works for all cases - PDFs, Office docs, etc. If it is that simple, why do we need all the functionality of '{{FileEmbeddedDocumentExtractor}} ' in the CLI, with all the dedicated-periphrial code such as {{class OutputType?}} in his answer, he said "*The best example I can think of for this is in the Tika CLI, as used by the {{-z}} (extract) flag. ...you're looking for the {{FileEmbeddedDocumentExtractor}} as your example.*" then he gives the simplest code - but this code seems partial and not "production ready" as {{FileEmbeddedDocumentExtractor}} ** ** please advise... > get embedded resources in doc files > --- > > Key: TIKA-2623 > URL: https://issues.apache.org/jira/browse/TIKA-2623 > Project: Tika > Issue Type: Improvement > Components: cli, core, parser >Reporter: Ohad R >Priority: Trivial > > according to > [https://stackoverflow.com/questions/20172465/get-embedded-resourses-in-doc-files-using-apache-tika,] > it is possible to recursively parse a document and save its sub-items (e.g. > images) in a folder thanks to FileEmbeddedDocumentExtractor. However, the > scope of the above class is only in the TikaCLI. > I think it should be visible to the applications that uses Tika (not only to > the CLI) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (TIKA-2623) get embedded resources in doc files
[ https://issues.apache.org/jira/browse/TIKA-2623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16425042#comment-16425042 ] Ohad R edited comment on TIKA-2623 at 4/4/18 5:59 AM: -- "We don't want to add huge numbers of extra dependencies to Tika Core" - I agree, this is why I've placed the file under tika-parser, which is already dependant on apache-io and poi. "As per the linked StackOverflow post, you can just write a few lines of Java yourself to do the saving in a similar way to the CLI, can you not just do that?" - I am not completely sure the code in S.O. works for all cases - PDFs, Office docs, etc. If it is that simple, why do we need all the functionality of '{{FileEmbeddedDocumentExtractor}} ' in the CLI, with all the dedicated-periphrial code such as {{class OutputType?}} in his answer, he said "*The best example I can think of for this is in the Tika CLI, as used by the {{-z}} (extract) flag. ...you're looking for the {{FileEmbeddedDocumentExtractor}} as your example.*" then he gives the simplest code - but this code seems partial and not "production ready" as {{FileEmbeddedDocumentExtractor}} ** ** please advise... was (Author: ohadr): "We don't want to add huge numbers of extra dependencies to Tika Core" - I agree, this is why I've placed the file under tika-parser, which is already dependant on apache-io and poi. "As per the linked StackOverflow post, you can just write a few lines of Java yourself to do the saving in a similar way to the CLI, can you not just do that?" - I am not completely sure the code in S.O. works for all cases - PDFs, Office docs, etc. If it is that simple, why do we need all the functionality of '{{FileEmbeddedDocumentExtractor}} ' in the CLI, with all the dedicated-periphrial code such as {{class OutputType?}} in his answer, he said "*The best example I can think of for this is in the Tika CLI, as used by the {{-z}} (extract) flag. If you look in the [source code for TikaCLI|http://svn.apache.org/repos/asf/tika/trunk/tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java], you're looking for the {{FileEmbeddedDocumentExtractor}} as your example.*" then he gives the simplest code - but this code seems partial and not "production ready" as {{FileEmbeddedDocumentExtractor}} * ** * please advise... > get embedded resources in doc files > --- > > Key: TIKA-2623 > URL: https://issues.apache.org/jira/browse/TIKA-2623 > Project: Tika > Issue Type: Improvement > Components: cli, core, parser >Reporter: Ohad R >Priority: Trivial > > according to > [https://stackoverflow.com/questions/20172465/get-embedded-resourses-in-doc-files-using-apache-tika,] > it is possible to recursively parse a document and save its sub-items (e.g. > images) in a folder thanks to FileEmbeddedDocumentExtractor. However, the > scope of the above class is only in the TikaCLI. > I think it should be visible to the applications that uses Tika (not only to > the CLI) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (TIKA-2623) get embedded resources in doc files
[ https://issues.apache.org/jira/browse/TIKA-2623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16425042#comment-16425042 ] Ohad R edited comment on TIKA-2623 at 4/4/18 5:37 AM: -- "We don't want to add huge numbers of extra dependencies to Tika Core" - I agree, this is why I've placed the file under tika-parser, which is already dependant on apache-io and poi. "As per the linked StackOverflow post, you can just write a few lines of Java yourself to do the saving in a similar way to the CLI, can you not just do that?" - I am not completely sure the code in S.O. works for all cases - PDFs, Office docs, etc. If it is that simple, why do we need all the functionality of '{{FileEmbeddedDocumentExtractor}} ' in the CLI, with all the dedicated-periphrial code such as {{class OutputType?}} in his answer, he said "*The best example I can think of for this is in the Tika CLI, as used by the {{-z}} (extract) flag. If you look in the [source code for TikaCLI|http://svn.apache.org/repos/asf/tika/trunk/tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java], you're looking for the {{FileEmbeddedDocumentExtractor}} as your example.*" then he gives the simplest code - but this code seems partial and not "production ready" as {{FileEmbeddedDocumentExtractor}} * ** * please advise... was (Author: ohadr): "We don't want to add huge numbers of extra dependencies to Tika Core" - I agree, this is why I've placed the file under tika-parser, which is already dependant on apache-io and poi. "As per the linked StackOverflow post, you can just write a few lines of Java yourself to do the saving in a similar way to the CLI, can you not just do that?" - I am not completely sure the code in S.O. works for all cases - PDFs, Office docs, etc. If it is that simple, why do we need all the functionality of 'FileEmbeddedDocumentExtractor' in the CLI, with all the dedicated-periphrial code such as {{class OutputType?}} in his answer, he said "*The best example I can think of for this is in the Tika CLI, as used by the {{-z}} (extract) flag. If you look in the [source code for TikaCLI|http://svn.apache.org/repos/asf/tika/trunk/tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java], you're looking for the {{FileEmbeddedDocumentExtractor}} as your example.*" then he gives the simplest code - but this code seems partial and not "production ready" as {{FileEmbeddedDocumentExtractor}} ** ** please advise... > get embedded resources in doc files > --- > > Key: TIKA-2623 > URL: https://issues.apache.org/jira/browse/TIKA-2623 > Project: Tika > Issue Type: Improvement > Components: cli, core, parser >Reporter: Ohad R >Priority: Trivial > > according to > [https://stackoverflow.com/questions/20172465/get-embedded-resourses-in-doc-files-using-apache-tika,] > it is possible to recursively parse a document and save its sub-items (e.g. > images) in a folder thanks to FileEmbeddedDocumentExtractor. However, the > scope of the above class is only in the TikaCLI. > I think it should be visible to the applications that uses Tika (not only to > the CLI) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2623) get embedded resources in doc files
[ https://issues.apache.org/jira/browse/TIKA-2623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16425042#comment-16425042 ] Ohad R commented on TIKA-2623: -- "We don't want to add huge numbers of extra dependencies to Tika Core" - I agree, this is why I've placed the file under tika-parser, which is already dependant on apache-io and poi. "As per the linked StackOverflow post, you can just write a few lines of Java yourself to do the saving in a similar way to the CLI, can you not just do that?" - I am not completely sure the code in S.O. works for all cases - PDFs, Office docs, etc. If it is that simple, why do we need all the functionality of 'FileEmbeddedDocumentExtractor' in the CLI, with all the dedicated-periphrial code such as {{class OutputType?}} in his answer, he said "*The best example I can think of for this is in the Tika CLI, as used by the {{-z}} (extract) flag. If you look in the [source code for TikaCLI|http://svn.apache.org/repos/asf/tika/trunk/tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java], you're looking for the {{FileEmbeddedDocumentExtractor}} as your example.*" then he gives the simplest code - but this code seems partial and not "production ready" as {{FileEmbeddedDocumentExtractor}} ** ** please advise... > get embedded resources in doc files > --- > > Key: TIKA-2623 > URL: https://issues.apache.org/jira/browse/TIKA-2623 > Project: Tika > Issue Type: Improvement > Components: cli, core, parser >Reporter: Ohad R >Priority: Trivial > > according to > [https://stackoverflow.com/questions/20172465/get-embedded-resourses-in-doc-files-using-apache-tika,] > it is possible to recursively parse a document and save its sub-items (e.g. > images) in a folder thanks to FileEmbeddedDocumentExtractor. However, the > scope of the above class is only in the TikaCLI. > I think it should be visible to the applications that uses Tika (not only to > the CLI) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (TIKA-2623) get embedded resources in doc files
[ https://issues.apache.org/jira/browse/TIKA-2623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16423900#comment-16423900 ] Ohad R edited comment on TIKA-2623 at 4/3/18 11:49 AM: --- i have refactored `FileEmbeddedDocumentExtractor`: moved it from tika-cli to tika-parsers. if we want it to be in tika-core (like 'ParsingEnbeddedDocumentExtractor'), then the pom.xml needs to be changed: need to add dependencies of apache-common-io, poi, etc. [https://github.com/OhadR/tika/commit/6e502f1bdc982bc4aa612efbb2450cfe6ca46fe1] anyone can have a look and let me know if I should create a push request for this? was (Author: ohadr): i have refactored `FileEmbeddedDocumentExtractor`: moved it from tika-cli to tika-parsers. if we want it to be in tika-core (like 'ParsingEnbeddedDocumentExtractor'), then the pom.xml needs to be changed: need to add dependencies of apache-common-io, poi, etc. > get embedded resources in doc files > --- > > Key: TIKA-2623 > URL: https://issues.apache.org/jira/browse/TIKA-2623 > Project: Tika > Issue Type: Improvement > Components: cli, core, parser >Reporter: Ohad R >Priority: Trivial > > according to > [https://stackoverflow.com/questions/20172465/get-embedded-resourses-in-doc-files-using-apache-tika,] > it is possible to recursively parse a document and save its sub-items (e.g. > images) in a folder thanks to FileEmbeddedDocumentExtractor. However, the > scope of the above class is only in the TikaCLI. > I think it should be visible to the applications that uses Tika (not only to > the CLI) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (TIKA-2623) get embedded resources in doc files
[ https://issues.apache.org/jira/browse/TIKA-2623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ohad R updated TIKA-2623: - Component/s: cli > get embedded resources in doc files > --- > > Key: TIKA-2623 > URL: https://issues.apache.org/jira/browse/TIKA-2623 > Project: Tika > Issue Type: Improvement > Components: cli, core, parser >Reporter: Ohad R >Priority: Trivial > > according to > [https://stackoverflow.com/questions/20172465/get-embedded-resourses-in-doc-files-using-apache-tika,] > it is possible to recursively parse a document and save its sub-items (e.g. > images) in a folder thanks to FileEmbeddedDocumentExtractor. However, the > scope of the above class is only in the TikaCLI. > I think it should be visible to the applications that uses Tika (not only to > the CLI) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2623) get embedded resources in doc files
[ https://issues.apache.org/jira/browse/TIKA-2623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16423900#comment-16423900 ] Ohad R commented on TIKA-2623: -- i have refactored `FileEmbeddedDocumentExtractor`: moved it from tika-cli to tika-parsers. if we want it to be in tika-core (like 'ParsingEnbeddedDocumentExtractor'), then the pom.xml needs to be changed: need to add dependencies of apache-common-io, poi, etc. > get embedded resources in doc files > --- > > Key: TIKA-2623 > URL: https://issues.apache.org/jira/browse/TIKA-2623 > Project: Tika > Issue Type: Improvement > Components: core, parser >Reporter: Ohad R >Priority: Trivial > > according to > [https://stackoverflow.com/questions/20172465/get-embedded-resourses-in-doc-files-using-apache-tika,] > it is possible to recursively parse a document and save its sub-items (e.g. > images) in a folder thanks to FileEmbeddedDocumentExtractor. However, the > scope of the above class is only in the TikaCLI. > I think it should be visible to the applications that uses Tika (not only to > the CLI) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (TIKA-2623) get embedded resources in doc files
Ohad R created TIKA-2623: Summary: get embedded resources in doc files Key: TIKA-2623 URL: https://issues.apache.org/jira/browse/TIKA-2623 Project: Tika Issue Type: Improvement Components: core, parser Reporter: Ohad R according to [https://stackoverflow.com/questions/20172465/get-embedded-resourses-in-doc-files-using-apache-tika,] it is possible to recursively parse a document and save its sub-items (e.g. images) in a folder thanks to FileEmbeddedDocumentExtractor. However, the scope of the above class is only in the TikaCLI. I think it should be visible to the applications that uses Tika (not only to the CLI) -- This message was sent by Atlassian JIRA (v7.6.3#76005)