[jira] [Reopened] (TIKA-2623) get embedded resources in PDF/doc files

2018-04-09 Thread Ohad R (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ohad R reopened TIKA-2623:
--

keep this "open" until merge

https://github.com/apache/tika/pull/233

> get embedded resources in PDF/doc files
> ---
>
> Key: TIKA-2623
> URL: https://issues.apache.org/jira/browse/TIKA-2623
> Project: Tika
>  Issue Type: Improvement
>  Components: cli, core, parser
>Reporter: Ohad R
>Priority: Trivial
> Fix For: 1.18
>
>
> The motivation: support embedded files in PDF, Word's doc/docx, etc.
> according to 
> [https://stackoverflow.com/questions/20172465/get-embedded-resourses-in-doc-files-using-apache-tika,]
>  it is possible to recursively parse a document and save its sub-items (e.g. 
> images) in a folder thanks to FileEmbeddedDocumentExtractor. However, the 
> scope of the above class is only in the TikaCLI.
> I think it should be visible to the applications that uses Tika (not only to 
> the CLI)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (TIKA-2623) get embedded resources in PDF/doc files

2018-04-09 Thread Ohad R (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ohad R resolved TIKA-2623.
--
   Resolution: Fixed
Fix Version/s: 1.18

[https://github.com/apache/tika/pull/233]

 

> get embedded resources in PDF/doc files
> ---
>
> Key: TIKA-2623
> URL: https://issues.apache.org/jira/browse/TIKA-2623
> Project: Tika
>  Issue Type: Improvement
>  Components: cli, core, parser
>Reporter: Ohad R
>Priority: Trivial
> Fix For: 1.18
>
>
> The motivation: support embedded files in PDF, Word's doc/docx, etc.
> according to 
> [https://stackoverflow.com/questions/20172465/get-embedded-resourses-in-doc-files-using-apache-tika,]
>  it is possible to recursively parse a document and save its sub-items (e.g. 
> images) in a folder thanks to FileEmbeddedDocumentExtractor. However, the 
> scope of the above class is only in the TikaCLI.
> I think it should be visible to the applications that uses Tika (not only to 
> the CLI)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (TIKA-2623) get embedded resources in PDF/doc files

2018-04-07 Thread Ohad R (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ohad R updated TIKA-2623:
-
Description: 
The motivation: support embedded files in PDF, Word's doc/docx, etc.

according to 
[https://stackoverflow.com/questions/20172465/get-embedded-resourses-in-doc-files-using-apache-tika,]
 it is possible to recursively parse a document and save its sub-items (e.g. 
images) in a folder thanks to FileEmbeddedDocumentExtractor. However, the scope 
of the above class is only in the TikaCLI.

I think it should be visible to the applications that uses Tika (not only to 
the CLI)

  was:
according to 
[https://stackoverflow.com/questions/20172465/get-embedded-resourses-in-doc-files-using-apache-tika,]
 it is possible to recursively parse a document and save its sub-items (e.g. 
images) in a folder thanks to FileEmbeddedDocumentExtractor. However, the scope 
of the above class is only in the TikaCLI.

I think it should be visible to the applications that uses Tika (not only to 
the CLI)


> get embedded resources in PDF/doc files
> ---
>
> Key: TIKA-2623
> URL: https://issues.apache.org/jira/browse/TIKA-2623
> Project: Tika
>  Issue Type: Improvement
>  Components: cli, core, parser
>Reporter: Ohad R
>Priority: Trivial
>
> The motivation: support embedded files in PDF, Word's doc/docx, etc.
> according to 
> [https://stackoverflow.com/questions/20172465/get-embedded-resourses-in-doc-files-using-apache-tika,]
>  it is possible to recursively parse a document and save its sub-items (e.g. 
> images) in a folder thanks to FileEmbeddedDocumentExtractor. However, the 
> scope of the above class is only in the TikaCLI.
> I think it should be visible to the applications that uses Tika (not only to 
> the CLI)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (TIKA-2623) get embedded resources in PDF/doc files

2018-04-07 Thread Ohad R (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ohad R updated TIKA-2623:
-
Summary: get embedded resources in PDF/doc files  (was: get embedded 
resources in doc files)

> get embedded resources in PDF/doc files
> ---
>
> Key: TIKA-2623
> URL: https://issues.apache.org/jira/browse/TIKA-2623
> Project: Tika
>  Issue Type: Improvement
>  Components: cli, core, parser
>Reporter: Ohad R
>Priority: Trivial
>
> according to 
> [https://stackoverflow.com/questions/20172465/get-embedded-resourses-in-doc-files-using-apache-tika,]
>  it is possible to recursively parse a document and save its sub-items (e.g. 
> images) in a folder thanks to FileEmbeddedDocumentExtractor. However, the 
> scope of the above class is only in the TikaCLI.
> I think it should be visible to the applications that uses Tika (not only to 
> the CLI)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (TIKA-2623) get embedded resources in doc files

2018-04-04 Thread Ohad R (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16425042#comment-16425042
 ] 

Ohad R edited comment on TIKA-2623 at 4/4/18 6:00 AM:
--

"We don't want to add huge numbers of extra dependencies to Tika Core" - I 
agree, this is why I've placed the file under tika-parser, which is already 
dependant on apache-io and poi.

"As per the linked StackOverflow post, you can just write a few lines of Java 
yourself to do the saving in a similar way to the CLI, can you not just do 
that?" - I am not completely sure the code in S.O. works for all cases - PDFs, 
Office docs, etc. If it is that simple, why do we need all the functionality of 
'{{FileEmbeddedDocumentExtractor}} ' in the CLI, with all the 
dedicated-periphrial code such as {{class OutputType?}}

in his answer, he said "*The best example I can think of for this is in the 
Tika CLI, as used by the {{-z}} (extract) flag. ...you're looking for the 
{{FileEmbeddedDocumentExtractor}} as your example.*" then he gives the simplest 
code - but this code seems partial and not "production ready" as 
{{FileEmbeddedDocumentExtractor}}

 

please advise...

 


was (Author: ohadr):
"We don't want to add huge numbers of extra dependencies to Tika Core" - I 
agree, this is why I've placed the file under tika-parser, which is already 
dependant on apache-io and poi.

"As per the linked StackOverflow post, you can just write a few lines of Java 
yourself to do the saving in a similar way to the CLI, can you not just do 
that?" - I am not completely sure the code in S.O. works for all cases - PDFs, 
Office docs, etc. If it is that simple, why do we need all the functionality of 
'{{FileEmbeddedDocumentExtractor}} ' in the CLI, with all the 
dedicated-periphrial code such as {{class OutputType?}}

in his answer, he said "*The best example I can think of for this is in the 
Tika CLI, as used by the {{-z}} (extract) flag. ...you're looking for the 
{{FileEmbeddedDocumentExtractor}} as your example.*" then he gives the simplest 
code - but this code seems partial and not "production ready" as 
{{FileEmbeddedDocumentExtractor}}  **  ** 

 

please advise...

 

> get embedded resources in doc files
> ---
>
> Key: TIKA-2623
> URL: https://issues.apache.org/jira/browse/TIKA-2623
> Project: Tika
>  Issue Type: Improvement
>  Components: cli, core, parser
>Reporter: Ohad R
>Priority: Trivial
>
> according to 
> [https://stackoverflow.com/questions/20172465/get-embedded-resourses-in-doc-files-using-apache-tika,]
>  it is possible to recursively parse a document and save its sub-items (e.g. 
> images) in a folder thanks to FileEmbeddedDocumentExtractor. However, the 
> scope of the above class is only in the TikaCLI.
> I think it should be visible to the applications that uses Tika (not only to 
> the CLI)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (TIKA-2623) get embedded resources in doc files

2018-04-04 Thread Ohad R (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16425042#comment-16425042
 ] 

Ohad R edited comment on TIKA-2623 at 4/4/18 5:59 AM:
--

"We don't want to add huge numbers of extra dependencies to Tika Core" - I 
agree, this is why I've placed the file under tika-parser, which is already 
dependant on apache-io and poi.

"As per the linked StackOverflow post, you can just write a few lines of Java 
yourself to do the saving in a similar way to the CLI, can you not just do 
that?" - I am not completely sure the code in S.O. works for all cases - PDFs, 
Office docs, etc. If it is that simple, why do we need all the functionality of 
'{{FileEmbeddedDocumentExtractor}} ' in the CLI, with all the 
dedicated-periphrial code such as {{class OutputType?}}

in his answer, he said "*The best example I can think of for this is in the 
Tika CLI, as used by the {{-z}} (extract) flag. ...you're looking for the 
{{FileEmbeddedDocumentExtractor}} as your example.*" then he gives the simplest 
code - but this code seems partial and not "production ready" as 
{{FileEmbeddedDocumentExtractor}}  **  ** 

 

please advise...

 


was (Author: ohadr):
"We don't want to add huge numbers of extra dependencies to Tika Core" - I 
agree, this is why I've placed the file under tika-parser, which is already 
dependant on apache-io and poi.

"As per the linked StackOverflow post, you can just write a few lines of Java 
yourself to do the saving in a similar way to the CLI, can you not just do 
that?" - I am not completely sure the code in S.O. works for all cases - PDFs, 
Office docs, etc. If it is that simple, why do we need all the functionality of 
'{{FileEmbeddedDocumentExtractor}} ' in the CLI, with all the 
dedicated-periphrial code such as {{class OutputType?}}

in his answer, he said "*The best example I can think of for this is in the 
Tika CLI, as used by the {{-z}} (extract) flag. If you look in the [source code 
for 
TikaCLI|http://svn.apache.org/repos/asf/tika/trunk/tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java],
 you're looking for the {{FileEmbeddedDocumentExtractor}} as your example.*" 
then he gives the simplest code - but this code seems partial and not 
"production ready" as {{FileEmbeddedDocumentExtractor}} * ** *

 

please advise...

 

> get embedded resources in doc files
> ---
>
> Key: TIKA-2623
> URL: https://issues.apache.org/jira/browse/TIKA-2623
> Project: Tika
>  Issue Type: Improvement
>  Components: cli, core, parser
>Reporter: Ohad R
>Priority: Trivial
>
> according to 
> [https://stackoverflow.com/questions/20172465/get-embedded-resourses-in-doc-files-using-apache-tika,]
>  it is possible to recursively parse a document and save its sub-items (e.g. 
> images) in a folder thanks to FileEmbeddedDocumentExtractor. However, the 
> scope of the above class is only in the TikaCLI.
> I think it should be visible to the applications that uses Tika (not only to 
> the CLI)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (TIKA-2623) get embedded resources in doc files

2018-04-03 Thread Ohad R (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16425042#comment-16425042
 ] 

Ohad R edited comment on TIKA-2623 at 4/4/18 5:37 AM:
--

"We don't want to add huge numbers of extra dependencies to Tika Core" - I 
agree, this is why I've placed the file under tika-parser, which is already 
dependant on apache-io and poi.

"As per the linked StackOverflow post, you can just write a few lines of Java 
yourself to do the saving in a similar way to the CLI, can you not just do 
that?" - I am not completely sure the code in S.O. works for all cases - PDFs, 
Office docs, etc. If it is that simple, why do we need all the functionality of 
'{{FileEmbeddedDocumentExtractor}} ' in the CLI, with all the 
dedicated-periphrial code such as {{class OutputType?}}

in his answer, he said "*The best example I can think of for this is in the 
Tika CLI, as used by the {{-z}} (extract) flag. If you look in the [source code 
for 
TikaCLI|http://svn.apache.org/repos/asf/tika/trunk/tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java],
 you're looking for the {{FileEmbeddedDocumentExtractor}} as your example.*" 
then he gives the simplest code - but this code seems partial and not 
"production ready" as {{FileEmbeddedDocumentExtractor}} * ** *

 

please advise...

 


was (Author: ohadr):
"We don't want to add huge numbers of extra dependencies to Tika Core" - I 
agree, this is why I've placed the file under tika-parser, which is already 
dependant on apache-io and poi.

"As per the linked StackOverflow post, you can just write a few lines of Java 
yourself to do the saving in a similar way to the CLI, can you not just do 
that?" - I am not completely sure the code in S.O. works for all cases - PDFs, 
Office docs, etc. If it is that simple, why do we need all the functionality of 
'FileEmbeddedDocumentExtractor' in the CLI, with all the dedicated-periphrial 
code such as {{class OutputType?}}

in his answer, he said "*The best example I can think of for this is in the 
Tika CLI, as used by the {{-z}} (extract) flag. If you look in the [source code 
for 
TikaCLI|http://svn.apache.org/repos/asf/tika/trunk/tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java],
 you're looking for the {{FileEmbeddedDocumentExtractor}} as your example.*" 
then he gives the simplest code - but this code seems partial and not 
"production ready" as {{FileEmbeddedDocumentExtractor}} ** **

 

please advise...

 

> get embedded resources in doc files
> ---
>
> Key: TIKA-2623
> URL: https://issues.apache.org/jira/browse/TIKA-2623
> Project: Tika
>  Issue Type: Improvement
>  Components: cli, core, parser
>Reporter: Ohad R
>Priority: Trivial
>
> according to 
> [https://stackoverflow.com/questions/20172465/get-embedded-resourses-in-doc-files-using-apache-tika,]
>  it is possible to recursively parse a document and save its sub-items (e.g. 
> images) in a folder thanks to FileEmbeddedDocumentExtractor. However, the 
> scope of the above class is only in the TikaCLI.
> I think it should be visible to the applications that uses Tika (not only to 
> the CLI)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2623) get embedded resources in doc files

2018-04-03 Thread Ohad R (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16425042#comment-16425042
 ] 

Ohad R commented on TIKA-2623:
--

"We don't want to add huge numbers of extra dependencies to Tika Core" - I 
agree, this is why I've placed the file under tika-parser, which is already 
dependant on apache-io and poi.

"As per the linked StackOverflow post, you can just write a few lines of Java 
yourself to do the saving in a similar way to the CLI, can you not just do 
that?" - I am not completely sure the code in S.O. works for all cases - PDFs, 
Office docs, etc. If it is that simple, why do we need all the functionality of 
'FileEmbeddedDocumentExtractor' in the CLI, with all the dedicated-periphrial 
code such as {{class OutputType?}}

in his answer, he said "*The best example I can think of for this is in the 
Tika CLI, as used by the {{-z}} (extract) flag. If you look in the [source code 
for 
TikaCLI|http://svn.apache.org/repos/asf/tika/trunk/tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java],
 you're looking for the {{FileEmbeddedDocumentExtractor}} as your example.*" 
then he gives the simplest code - but this code seems partial and not 
"production ready" as {{FileEmbeddedDocumentExtractor}} ** **

 

please advise...

 

> get embedded resources in doc files
> ---
>
> Key: TIKA-2623
> URL: https://issues.apache.org/jira/browse/TIKA-2623
> Project: Tika
>  Issue Type: Improvement
>  Components: cli, core, parser
>Reporter: Ohad R
>Priority: Trivial
>
> according to 
> [https://stackoverflow.com/questions/20172465/get-embedded-resourses-in-doc-files-using-apache-tika,]
>  it is possible to recursively parse a document and save its sub-items (e.g. 
> images) in a folder thanks to FileEmbeddedDocumentExtractor. However, the 
> scope of the above class is only in the TikaCLI.
> I think it should be visible to the applications that uses Tika (not only to 
> the CLI)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (TIKA-2623) get embedded resources in doc files

2018-04-03 Thread Ohad R (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16423900#comment-16423900
 ] 

Ohad R edited comment on TIKA-2623 at 4/3/18 11:49 AM:
---

i have refactored `FileEmbeddedDocumentExtractor`: moved it from tika-cli to 
tika-parsers. 

if we want it to be in tika-core (like 'ParsingEnbeddedDocumentExtractor'), 
then the pom.xml needs to be changed: need to add dependencies of 
apache-common-io, poi, etc.

 

[https://github.com/OhadR/tika/commit/6e502f1bdc982bc4aa612efbb2450cfe6ca46fe1]

 

anyone can have a look and let me know if I should create a push request for 
this?


was (Author: ohadr):
i have refactored `FileEmbeddedDocumentExtractor`: moved it from tika-cli to 
tika-parsers. 

if we want it to be in tika-core (like 'ParsingEnbeddedDocumentExtractor'), 
then the pom.xml needs to be changed: need to add dependencies of 
apache-common-io, poi, etc.

> get embedded resources in doc files
> ---
>
> Key: TIKA-2623
> URL: https://issues.apache.org/jira/browse/TIKA-2623
> Project: Tika
>  Issue Type: Improvement
>  Components: cli, core, parser
>Reporter: Ohad R
>Priority: Trivial
>
> according to 
> [https://stackoverflow.com/questions/20172465/get-embedded-resourses-in-doc-files-using-apache-tika,]
>  it is possible to recursively parse a document and save its sub-items (e.g. 
> images) in a folder thanks to FileEmbeddedDocumentExtractor. However, the 
> scope of the above class is only in the TikaCLI.
> I think it should be visible to the applications that uses Tika (not only to 
> the CLI)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (TIKA-2623) get embedded resources in doc files

2018-04-03 Thread Ohad R (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ohad R updated TIKA-2623:
-
Component/s: cli

> get embedded resources in doc files
> ---
>
> Key: TIKA-2623
> URL: https://issues.apache.org/jira/browse/TIKA-2623
> Project: Tika
>  Issue Type: Improvement
>  Components: cli, core, parser
>Reporter: Ohad R
>Priority: Trivial
>
> according to 
> [https://stackoverflow.com/questions/20172465/get-embedded-resourses-in-doc-files-using-apache-tika,]
>  it is possible to recursively parse a document and save its sub-items (e.g. 
> images) in a folder thanks to FileEmbeddedDocumentExtractor. However, the 
> scope of the above class is only in the TikaCLI.
> I think it should be visible to the applications that uses Tika (not only to 
> the CLI)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2623) get embedded resources in doc files

2018-04-03 Thread Ohad R (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16423900#comment-16423900
 ] 

Ohad R commented on TIKA-2623:
--

i have refactored `FileEmbeddedDocumentExtractor`: moved it from tika-cli to 
tika-parsers. 

if we want it to be in tika-core (like 'ParsingEnbeddedDocumentExtractor'), 
then the pom.xml needs to be changed: need to add dependencies of 
apache-common-io, poi, etc.

> get embedded resources in doc files
> ---
>
> Key: TIKA-2623
> URL: https://issues.apache.org/jira/browse/TIKA-2623
> Project: Tika
>  Issue Type: Improvement
>  Components: core, parser
>Reporter: Ohad R
>Priority: Trivial
>
> according to 
> [https://stackoverflow.com/questions/20172465/get-embedded-resourses-in-doc-files-using-apache-tika,]
>  it is possible to recursively parse a document and save its sub-items (e.g. 
> images) in a folder thanks to FileEmbeddedDocumentExtractor. However, the 
> scope of the above class is only in the TikaCLI.
> I think it should be visible to the applications that uses Tika (not only to 
> the CLI)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (TIKA-2623) get embedded resources in doc files

2018-04-02 Thread Ohad R (JIRA)
Ohad R created TIKA-2623:


 Summary: get embedded resources in doc files
 Key: TIKA-2623
 URL: https://issues.apache.org/jira/browse/TIKA-2623
 Project: Tika
  Issue Type: Improvement
  Components: core, parser
Reporter: Ohad R


according to 
[https://stackoverflow.com/questions/20172465/get-embedded-resourses-in-doc-files-using-apache-tika,]
 it is possible to recursively parse a document and save its sub-items (e.g. 
images) in a folder thanks to FileEmbeddedDocumentExtractor. However, the scope 
of the above class is only in the TikaCLI.

I think it should be visible to the applications that uses Tika (not only to 
the CLI)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)