[jira] [Commented] (TIKA-2262) Supporting Image-to-Text (Image Captioning) in Tika for Image MIME Types

2017-07-08 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16079427#comment-16079427
 ] 

ASF GitHub Bot commented on TIKA-2262:
--

chrismattmann commented on a change in pull request #189: Fix for TIKA-2262: 
Supporting Image-to-Text (Image Captioning) in Tika
URL: https://github.com/apache/tika/pull/189#discussion_r126294778
 
 

 ##
 File path: 
tika-parsers/src/main/resources/org/apache/tika/parser/captioning/tf/Im2txtRestDockerfile
 ##
 @@ -0,0 +1,62 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+FROM inception-rest-tika
+MAINTAINER Apache Tika Team
+
+#Install python pillow. TODO: Remove this with the fix for TIKA 2398
+RUN pip install pillow
+
+# Download the pretrained im2txt checkpoint
+WORKDIR /usr/share/apache-tika/models/dl/image/caption/
+
+RUN \
+wget https://www.dropbox.com/s/l9ignjpjk774n2z/model_main_untuned.zip?dl=0 
\
 
 Review comment:
   can we please check in the model to 
https://github.com/USCDataScience/img2text.git? I can create the repo for you. 
Then please check in any scripts you used to generate the model. Then we can 
check in the model zip file.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Supporting Image-to-Text (Image Captioning) in Tika for Image MIME Types
> 
>
> Key: TIKA-2262
> URL: https://issues.apache.org/jira/browse/TIKA-2262
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Thamme Gowda
>Assignee: Thamme Gowda
>  Labels: deeplearning, gsoc2017, machine_learning
>
> h2. Background:
> Image captions are a small piece of text, usually of one line, added to the 
> metadata of images to provide a brief summary of the scenery in the image. 
> It is a challenging and interesting problem in the domain of computer vision. 
> Tika already has a support for image recognition via [Object Recognition 
> Parser, TIKA-1993| https://issues.apache.org/jira/browse/TIKA-1993] which 
> uses an InceptionV3 model pre-trained on ImageNet dataset using tensorflow. 
> Captioning an image is a very useful feature since it helps text based 
> Information Retrieval(IR) systems to "understand" the scenery in images.
> h2. Technical details and references:
> * Google has long back open sourced their 'show and tell' neural network and 
> its model for autogenerating captions. [Source Code| 
> https://github.com/tensorflow/models/tree/master/im2txt], [Research blog| 
> https://research.googleblog.com/2016/09/show-and-tell-image-captioning-open.html]
> * Integrate it the same way as the ObjectRecognitionParser
> ** Create a RESTful API Service [similar to this| 
> https://wiki.apache.org/tika/TikaAndVision#A2._Tensorflow_Using_REST_Server] 
> ** Extend or enhance ObjectRecognitionParser or one of its implementation
> h2. {skills, learning, homework} for GSoC students
> * Knowledge of languages: java AND python, and maven build system
> * RESTful APIs 
> * tensorflow/keras,
> * deeplearning
> 
> Alternatively, a little more harder path for experienced:
> [Import keras/tensorflow model to 
> deeplearning4j|https://deeplearning4j.org/model-import-keras ] and run them 
> natively inside JVM.
> h4. Benefits
> * no RESTful integration required. thus no external dependencies
> * easy to distribute on hadoop/spark clusters
> h4. Hurdles:
> * This is a work in progress feature on deeplearning4j and hence expected to 
> have lots of troubles on the way! 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TIKA-2262) Supporting Image-to-Text (Image Captioning) in Tika for Image MIME Types

2017-07-08 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16079359#comment-16079359
 ] 

ASF GitHub Bot commented on TIKA-2262:
--

chrismattmann closed pull request #189: Fix for TIKA-2262: Supporting 
Image-to-Text (Image Captioning) in Tika
URL: https://github.com/apache/tika/pull/189
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Supporting Image-to-Text (Image Captioning) in Tika for Image MIME Types
> 
>
> Key: TIKA-2262
> URL: https://issues.apache.org/jira/browse/TIKA-2262
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Thamme Gowda
>Assignee: Thamme Gowda
>  Labels: deeplearning, gsoc2017, machine_learning
>
> h2. Background:
> Image captions are a small piece of text, usually of one line, added to the 
> metadata of images to provide a brief summary of the scenery in the image. 
> It is a challenging and interesting problem in the domain of computer vision. 
> Tika already has a support for image recognition via [Object Recognition 
> Parser, TIKA-1993| https://issues.apache.org/jira/browse/TIKA-1993] which 
> uses an InceptionV3 model pre-trained on ImageNet dataset using tensorflow. 
> Captioning an image is a very useful feature since it helps text based 
> Information Retrieval(IR) systems to "understand" the scenery in images.
> h2. Technical details and references:
> * Google has long back open sourced their 'show and tell' neural network and 
> its model for autogenerating captions. [Source Code| 
> https://github.com/tensorflow/models/tree/master/im2txt], [Research blog| 
> https://research.googleblog.com/2016/09/show-and-tell-image-captioning-open.html]
> * Integrate it the same way as the ObjectRecognitionParser
> ** Create a RESTful API Service [similar to this| 
> https://wiki.apache.org/tika/TikaAndVision#A2._Tensorflow_Using_REST_Server] 
> ** Extend or enhance ObjectRecognitionParser or one of its implementation
> h2. {skills, learning, homework} for GSoC students
> * Knowledge of languages: java AND python, and maven build system
> * RESTful APIs 
> * tensorflow/keras,
> * deeplearning
> 
> Alternatively, a little more harder path for experienced:
> [Import keras/tensorflow model to 
> deeplearning4j|https://deeplearning4j.org/model-import-keras ] and run them 
> natively inside JVM.
> h4. Benefits
> * no RESTful integration required. thus no external dependencies
> * easy to distribute on hadoop/spark clusters
> h4. Hurdles:
> * This is a work in progress feature on deeplearning4j and hence expected to 
> have lots of troubles on the way! 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TIKA-2262) Supporting Image-to-Text (Image Captioning) in Tika for Image MIME Types

2017-07-08 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16079360#comment-16079360
 ] 

ASF GitHub Bot commented on TIKA-2262:
--

chrismattmann commented on issue #189: Fix for TIKA-2262: Supporting 
Image-to-Text (Image Captioning) in Tika
URL: https://github.com/apache/tika/pull/189#issuecomment-313886335
 
 
   merged into the branch. I am going to test this right now. Let's keep 
working on the branch.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Supporting Image-to-Text (Image Captioning) in Tika for Image MIME Types
> 
>
> Key: TIKA-2262
> URL: https://issues.apache.org/jira/browse/TIKA-2262
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Thamme Gowda
>Assignee: Thamme Gowda
>  Labels: deeplearning, gsoc2017, machine_learning
>
> h2. Background:
> Image captions are a small piece of text, usually of one line, added to the 
> metadata of images to provide a brief summary of the scenery in the image. 
> It is a challenging and interesting problem in the domain of computer vision. 
> Tika already has a support for image recognition via [Object Recognition 
> Parser, TIKA-1993| https://issues.apache.org/jira/browse/TIKA-1993] which 
> uses an InceptionV3 model pre-trained on ImageNet dataset using tensorflow. 
> Captioning an image is a very useful feature since it helps text based 
> Information Retrieval(IR) systems to "understand" the scenery in images.
> h2. Technical details and references:
> * Google has long back open sourced their 'show and tell' neural network and 
> its model for autogenerating captions. [Source Code| 
> https://github.com/tensorflow/models/tree/master/im2txt], [Research blog| 
> https://research.googleblog.com/2016/09/show-and-tell-image-captioning-open.html]
> * Integrate it the same way as the ObjectRecognitionParser
> ** Create a RESTful API Service [similar to this| 
> https://wiki.apache.org/tika/TikaAndVision#A2._Tensorflow_Using_REST_Server] 
> ** Extend or enhance ObjectRecognitionParser or one of its implementation
> h2. {skills, learning, homework} for GSoC students
> * Knowledge of languages: java AND python, and maven build system
> * RESTful APIs 
> * tensorflow/keras,
> * deeplearning
> 
> Alternatively, a little more harder path for experienced:
> [Import keras/tensorflow model to 
> deeplearning4j|https://deeplearning4j.org/model-import-keras ] and run them 
> natively inside JVM.
> h4. Benefits
> * no RESTful integration required. thus no external dependencies
> * easy to distribute on hadoop/spark clusters
> h4. Hurdles:
> * This is a work in progress feature on deeplearning4j and hence expected to 
> have lots of troubles on the way! 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


Re: [VOTE] Release Apache Tika 1.16 Candidate #1

2017-07-08 Thread Chris Mattmann
+1 from me SIGS and CHECKSUMS look good. 

Thanks Tim!

Cheers,
Chris

LMC-053601:apache-tika-1.16-rc1 mattmann$ for type in "" \-app \-eval \-server; 
do $HOME/bin/stage_apache_rc tika$type 1.16 
https://dist.apache.org/repos/dist/dev/tika/; done
  % Total% Received % Xferd  Average Speed   TimeTime Time  Current
 Dload  Upload   Total   SpentLeft  Speed
100 53.5M  100 53.5M0 0  3992k  0  0:00:13  0:00:13 --:--:-- 5122k
  % Total% Received % Xferd  Average Speed   TimeTime Time  Current
 Dload  Upload   Total   SpentLeft  Speed
100   836  100   8360 0   1092  0 --:--:-- --:--:-- --:--:--  1092
  % Total% Received % Xferd  Average Speed   TimeTime Time  Current
 Dload  Upload   Total   SpentLeft  Speed
10034  100340 0 96  0 --:--:-- --:--:-- --:--:--96
  % Total% Received % Xferd  Average Speed   TimeTime Time  Current
 Dload  Upload   Total   SpentLeft  Speed
100 41.6M  100 41.6M0 0  6578k  0  0:00:06  0:00:06 --:--:-- 8297k
  % Total% Received % Xferd  Average Speed   TimeTime Time  Current
 Dload  Upload   Total   SpentLeft  Speed
100   836  100   8360 0   1012  0 --:--:-- --:--:-- --:--:--  1012
  % Total% Received % Xferd  Average Speed   TimeTime Time  Current
 Dload  Upload   Total   SpentLeft  Speed
10034  100340 0 46  0 --:--:-- --:--:-- --:--:--46
  % Total% Received % Xferd  Average Speed   TimeTime Time  Current
 Dload  Upload   Total   SpentLeft  Speed
100 56.4M  100 56.4M0 0  3950k  0  0:00:14  0:00:14 --:--:-- 4742k
  % Total% Received % Xferd  Average Speed   TimeTime Time  Current
 Dload  Upload   Total   SpentLeft  Speed
100   836  100   8360 0   1470  0 --:--:-- --:--:-- --:--:--  1469
  % Total% Received % Xferd  Average Speed   TimeTime Time  Current
 Dload  Upload   Total   SpentLeft  Speed
10034  100340 0 65  0 --:--:-- --:--:-- --:--:--65
LMC-053601:apache-tika-1.16-rc1 mattmann$ $HOME/bin/stage_apache_rc tika 
1.16-src https://dist.apache.org/repos/dist/dev/tika/
  % Total% Received % Xferd  Average Speed   TimeTime Time  Current
 Dload  Upload   Total   SpentLeft  Speed
100 84.2M  100 84.2M0 0  6563k  0  0:00:13  0:00:13 --:--:-- 5261k
  % Total% Received % Xferd  Average Speed   TimeTime Time  Current
 Dload  Upload   Total   SpentLeft  Speed
100   836  100   8360 0   2129  0 --:--:-- --:--:-- --:--:--  2127
  % Total% Received % Xferd  Average Speed   TimeTime Time  Current
 Dload  Upload   Total   SpentLeft  Speed
10034  100340 0 47  0 --:--:-- --:--:-- --:--:--47
LMC-053601:apache-tika-1.16-rc1 mattmann$ ls
tika-1.16-src.zip   tika-app-1.16.jar   
tika-eval-1.16.jar  tika-server-1.16.jar
tika-1.16-src.zip.asc   tika-app-1.16.jar.asc   
tika-eval-1.16.jar.asc  tika-server-1.16.jar.asc
tika-1.16-src.zip.md5   tika-app-1.16.jar.md5   
tika-eval-1.16.jar.md5  tika-server-1.16.jar.md5
LMC-053601:apache-tika-1.16-rc1 mattmann$ $HOME/bin/verify_gpg_sigs
Verifying Signature for file tika-1.16-src.zip.asc
gpg: assuming signed data in `tika-1.16-src.zip'
gpg: Signature made Fri Jul  7 19:27:42 2017 PDT using RSA key ID EF0CF38A
gpg: Good signature from "Tim Allison (ASF signing key) "
gpg: WARNING: This key is not certified with a trusted signature!
gpg:  There is no indication that the signature belongs to the owner.
Primary key fingerprint: 833C 1CC4 926C 1DDE 29BB  8731 E403 2DC4 EF0C F38A
Verifying Signature for file tika-app-1.16.jar.asc
gpg: assuming signed data in `tika-app-1.16.jar'
gpg: Signature made Fri Jul  7 19:13:16 2017 PDT using RSA key ID EF0CF38A
gpg: Good signature from "Tim Allison (ASF signing key) "
gpg: WARNING: This key is not certified with a trusted signature!
gpg:  There is no indication that the signature belongs to the owner.
Primary key fingerprint: 833C 1CC4 926C 1DDE 29BB  8731 E403 2DC4 EF0C F38A
Verifying Signature for file tika-eval-1.16.jar.asc
gpg: assuming signed data in `tika-eval-1.16.jar'
gpg: Signature made Fri Jul  7 19:20:17 2017 PDT using RSA key ID EF0CF38A
gpg: Good signature from "Tim Allison (ASF signing key) "
gpg: WARNING: This key is not certified with a trusted signature!
gpg:  There is no 

[jira] [Commented] (TIKA-1367) Tika documentation should list tika-parsers parser dependencies

2017-07-08 Thread Gus Heck (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16079164#comment-16079164
 ] 

Gus Heck commented on TIKA-1367:


Update... I did start making test cases, but this may have been a matter of my 
artifactory instance behaving strangely. I'm seeing different results with 
mavenCentral() directly... it was supposedly proxying central transparently, 
but perhaps not quite. 

> Tika documentation should list tika-parsers parser dependencies
> ---
>
> Key: TIKA-1367
> URL: https://issues.apache.org/jira/browse/TIKA-1367
> Project: Tika
>  Issue Type: Improvement
>  Components: documentation
>Reporter: Sergey Beryozkin
> Fix For: 1.17
>
>
> tika-parsers module has many strong transitive parser dependencies. Maven 
> users of tika-parsers have to exclude all the transitivie dependencies 
> manually. Documenting the list of the existing transitive dependencies and 
> keeping the list up to date will help developers exclude the libraries not 
> needed for a given project.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TIKA-2262) Supporting Image-to-Text (Image Captioning) in Tika for Image MIME Types

2017-07-08 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16079096#comment-16079096
 ] 

ASF GitHub Bot commented on TIKA-2262:
--

ThejanW commented on issue #189: Fix for TIKA-2262: Supporting Image-to-Text 
(Image Captioning) in Tika
URL: https://github.com/apache/tika/pull/189#issuecomment-313850196
 
 
   @thammegowda @chrismattmann please merge my commits so I can proceed.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Supporting Image-to-Text (Image Captioning) in Tika for Image MIME Types
> 
>
> Key: TIKA-2262
> URL: https://issues.apache.org/jira/browse/TIKA-2262
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Thamme Gowda
>Assignee: Thamme Gowda
>  Labels: deeplearning, gsoc2017, machine_learning
>
> h2. Background:
> Image captions are a small piece of text, usually of one line, added to the 
> metadata of images to provide a brief summary of the scenery in the image. 
> It is a challenging and interesting problem in the domain of computer vision. 
> Tika already has a support for image recognition via [Object Recognition 
> Parser, TIKA-1993| https://issues.apache.org/jira/browse/TIKA-1993] which 
> uses an InceptionV3 model pre-trained on ImageNet dataset using tensorflow. 
> Captioning an image is a very useful feature since it helps text based 
> Information Retrieval(IR) systems to "understand" the scenery in images.
> h2. Technical details and references:
> * Google has long back open sourced their 'show and tell' neural network and 
> its model for autogenerating captions. [Source Code| 
> https://github.com/tensorflow/models/tree/master/im2txt], [Research blog| 
> https://research.googleblog.com/2016/09/show-and-tell-image-captioning-open.html]
> * Integrate it the same way as the ObjectRecognitionParser
> ** Create a RESTful API Service [similar to this| 
> https://wiki.apache.org/tika/TikaAndVision#A2._Tensorflow_Using_REST_Server] 
> ** Extend or enhance ObjectRecognitionParser or one of its implementation
> h2. {skills, learning, homework} for GSoC students
> * Knowledge of languages: java AND python, and maven build system
> * RESTful APIs 
> * tensorflow/keras,
> * deeplearning
> 
> Alternatively, a little more harder path for experienced:
> [Import keras/tensorflow model to 
> deeplearning4j|https://deeplearning4j.org/model-import-keras ] and run them 
> natively inside JVM.
> h4. Benefits
> * no RESTful integration required. thus no external dependencies
> * easy to distribute on hadoop/spark clusters
> h4. Hurdles:
> * This is a work in progress feature on deeplearning4j and hence expected to 
> have lots of troubles on the way! 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TIKA-2262) Supporting Image-to-Text (Image Captioning) in Tika for Image MIME Types

2017-07-08 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16079095#comment-16079095
 ] 

ASF GitHub Bot commented on TIKA-2262:
--

ThejanW commented on issue #189: Fix for TIKA-2262: Supporting Image-to-Text 
(Image Captioning) in Tika
URL: https://github.com/apache/tika/pull/189#issuecomment-313850148
 
 
   @tballison I could fix the maven import errors, then there was error 
`TensorflowRESTCaptioner is not abstract and does not override abstract method 
checkInitialization(org.apache.tika.config.InitializableProblemHandler) in 
org.apache.tika.config.Initializable`
   
   I had to put this in TensorflowRESTCaptioner to get rid of the error.
   `@Override
   public void checkInitialization(InitializableProblemHandler handler) 
throws TikaConfigException {
   }`
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Supporting Image-to-Text (Image Captioning) in Tika for Image MIME Types
> 
>
> Key: TIKA-2262
> URL: https://issues.apache.org/jira/browse/TIKA-2262
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Thamme Gowda
>Assignee: Thamme Gowda
>  Labels: deeplearning, gsoc2017, machine_learning
>
> h2. Background:
> Image captions are a small piece of text, usually of one line, added to the 
> metadata of images to provide a brief summary of the scenery in the image. 
> It is a challenging and interesting problem in the domain of computer vision. 
> Tika already has a support for image recognition via [Object Recognition 
> Parser, TIKA-1993| https://issues.apache.org/jira/browse/TIKA-1993] which 
> uses an InceptionV3 model pre-trained on ImageNet dataset using tensorflow. 
> Captioning an image is a very useful feature since it helps text based 
> Information Retrieval(IR) systems to "understand" the scenery in images.
> h2. Technical details and references:
> * Google has long back open sourced their 'show and tell' neural network and 
> its model for autogenerating captions. [Source Code| 
> https://github.com/tensorflow/models/tree/master/im2txt], [Research blog| 
> https://research.googleblog.com/2016/09/show-and-tell-image-captioning-open.html]
> * Integrate it the same way as the ObjectRecognitionParser
> ** Create a RESTful API Service [similar to this| 
> https://wiki.apache.org/tika/TikaAndVision#A2._Tensorflow_Using_REST_Server] 
> ** Extend or enhance ObjectRecognitionParser or one of its implementation
> h2. {skills, learning, homework} for GSoC students
> * Knowledge of languages: java AND python, and maven build system
> * RESTful APIs 
> * tensorflow/keras,
> * deeplearning
> 
> Alternatively, a little more harder path for experienced:
> [Import keras/tensorflow model to 
> deeplearning4j|https://deeplearning4j.org/model-import-keras ] and run them 
> natively inside JVM.
> h4. Benefits
> * no RESTful integration required. thus no external dependencies
> * easy to distribute on hadoop/spark clusters
> h4. Hurdles:
> * This is a work in progress feature on deeplearning4j and hence expected to 
> have lots of troubles on the way! 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)