[
https://issues.apache.org/jira/browse/TIKA-2672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16537286#comment-16537286
]
ASF GitHub Bot commented on TIKA-2672:
--------------------------------------
chrismattmann commented on issue #241: Fix for TIKA-2672
URL: https://github.com/apache/tika/pull/241#issuecomment-403553789
OK, tested VGG16, looks awesome, and works, FYI (note I had to `rm
$HOME/.tika-dl/` and folks may also want to `rm -rf $HOME/.deeplearning4j*`):
## VGG16 server outputs:
```nonas:tika2.0.0 mattmann$ tika
--config=tika-dl/src/test/resources/org/apache/tika/dl/imagerec/dl4j-vgg16-config.xml
Jul 09, 2018 10:12:35 AM
org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed.
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.
Jul 09, 2018 10:12:35 AM
org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
WARNING: Tesseract OCR is installed and will be automatically applied to
image files unless
you've excluded the TesseractOCRParser from the default parser.
Tesseract may dramatically slow down content extraction (TIKA-2359).
As of Tika 1.15 (and prior versions), Tesseract is automatically called.
In future versions of Tika, users may need to turn the TesseractOCRParser on
via TikaConfig.
Jul 09, 2018 10:12:35 AM
org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
WARNING: org.xerial's sqlite-jdbc is not loaded.
Please provide the jar on your classpath to parse sqlite files.
See tika-parsers/pom.xml for the correct version.
INFO Starting Apache Tika 2.0.0-SNAPSHOT server
INFO Using custom config:
tika-dl/src/test/resources/org/apache/tika/dl/imagerec/dl4j-vgg16-config.xml
INFO Loaded [CpuBackend] backend
INFO Number of threads used for NativeOps: 2
INFO Number of threads used for BLAS: 2
INFO Backend used: [CPU]; OS: [Mac OS X]
INFO Cores: [4]; Memory: [3.6GB];
INFO Blas vendor: [MKL]
WARN java.io.UTFDataFormatException: malformed input around byte 11
java.lang.RuntimeException: java.io.UTFDataFormatException: malformed input
around byte 11
at
org.nd4j.linalg.api.buffer.BaseDataBuffer.read(BaseDataBuffer.java:1509)
at
org.nd4j.linalg.compression.CompressedDataBuffer.readUnknown(CompressedDataBuffer.java:83)
at org.nd4j.linalg.factory.Nd4j.read(Nd4j.java:2725)
at
org.deeplearning4j.util.ModelSerializer.restoreComputationGraph(ModelSerializer.java:564)
at
org.deeplearning4j.util.ModelSerializer.restoreComputationGraph(ModelSerializer.java:476)
at
org.apache.tika.dl.imagerec.DL4JVGG16Net.initialize(DL4JVGG16Net.java:95)
at
org.apache.tika.parser.recognition.ObjectRecognitionParser.initialize(ObjectRecognitionParser.java:94)
at
org.apache.tika.config.TikaConfig$XmlLoader.loadOne(TikaConfig.java:644)
at
org.apache.tika.config.TikaConfig$XmlLoader.loadOverall(TikaConfig.java:554)
at org.apache.tika.config.TikaConfig.<init>(TikaConfig.java:191)
at org.apache.tika.config.TikaConfig.<init>(TikaConfig.java:172)
at org.apache.tika.config.TikaConfig.<init>(TikaConfig.java:165)
at org.apache.tika.config.TikaConfig.<init>(TikaConfig.java:129)
at org.apache.tika.config.TikaConfig.<init>(TikaConfig.java:124)
at org.apache.tika.server.TikaServerCli.main(TikaServerCli.java:156)
Caused by: java.io.UTFDataFormatException: malformed input around byte 11
at java.io.DataInputStream.readUTF(DataInputStream.java:656)
at java.io.DataInputStream.readUTF(DataInputStream.java:564)
at
org.nd4j.linalg.api.buffer.BaseDataBuffer.read(BaseDataBuffer.java:1450)
... 14 more
ERROR Can't start
org.apache.tika.exception.TikaConfigException:
java.io.UTFDataFormatException: malformed input around byte 11
at
org.apache.tika.dl.imagerec.DL4JVGG16Net.initialize(DL4JVGG16Net.java:115)
at
org.apache.tika.parser.recognition.ObjectRecognitionParser.initialize(ObjectRecognitionParser.java:94)
at
org.apache.tika.config.TikaConfig$XmlLoader.loadOne(TikaConfig.java:644)
at
org.apache.tika.config.TikaConfig$XmlLoader.loadOverall(TikaConfig.java:554)
at org.apache.tika.config.TikaConfig.<init>(TikaConfig.java:191)
at org.apache.tika.config.TikaConfig.<init>(TikaConfig.java:172)
at org.apache.tika.config.TikaConfig.<init>(TikaConfig.java:165)
at org.apache.tika.config.TikaConfig.<init>(TikaConfig.java:129)
at org.apache.tika.config.TikaConfig.<init>(TikaConfig.java:124)
at org.apache.tika.server.TikaServerCli.main(TikaServerCli.java:156)
Caused by: java.lang.RuntimeException: java.io.UTFDataFormatException:
malformed input around byte 11
at
org.nd4j.linalg.api.buffer.BaseDataBuffer.read(BaseDataBuffer.java:1509)
at
org.nd4j.linalg.compression.CompressedDataBuffer.readUnknown(CompressedDataBuffer.java:83)
at org.nd4j.linalg.factory.Nd4j.read(Nd4j.java:2725)
at
org.deeplearning4j.util.ModelSerializer.restoreComputationGraph(ModelSerializer.java:564)
at
org.deeplearning4j.util.ModelSerializer.restoreComputationGraph(ModelSerializer.java:476)
at
org.apache.tika.dl.imagerec.DL4JVGG16Net.initialize(DL4JVGG16Net.java:95)
... 9 more
Caused by: java.io.UTFDataFormatException: malformed input around byte 11
at java.io.DataInputStream.readUTF(DataInputStream.java:656)
at java.io.DataInputStream.readUTF(DataInputStream.java:564)
at
org.nd4j.linalg.api.buffer.BaseDataBuffer.read(BaseDataBuffer.java:1450)
... 14 more
nonas:tika2.0.0 mattmann$ cat
tika-dl/src/test/resources/org/apache/tika/dl/imagerec/dl4j-vgg16-config.xml
<?xml version="1.0" encoding="UTF-8"?>
<!--
~ Licensed to the Apache Software Foundation (ASF) under one or more
~ contributor license agreements. See the NOTICE file distributed with
~ this work for additional information regarding copyright ownership.
~ The ASF licenses this file to You under the Apache License, Version 2.0
~ (the "License"); you may not use this file except in compliance with
~ the License. You may obtain a copy of the License at
~
~ http://www.apache.org/licenses/LICENSE-2.0
~
~ Unless required by applicable law or agreed to in writing, software
~ distributed under the License is distributed on an "AS IS" BASIS,
~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
~ See the License for the specific language governing permissions and
~ limitations under the License.
-->
<properties>
<parsers>
<parser
class="org.apache.tika.parser.recognition.ObjectRecognitionParser">
<mime>image/jpeg</mime>
<params>
<param name="topN" type="int">3</param>
<param name="minConfidence" type="double">0.015</param>
<param name="class"
type="string">org.apache.tika.dl.imagerec.DL4JVGG16Net</param>
<param name="modelType" type="string">VGG16</param>
<param name="serialize" type="bool">true</param>
</params>
</parser>
</parsers>
</properties>
nonas:tika2.0.0 mattmann$ ls /Users/mattmann/.tika-dl/
models
nonas:tika2.0.0 mattmann$ rm -rf $HOME/.tika-dl/
nonas:tika2.0.0 mattmann$ tika
--config=tika-dl/src/test/resources/org/apache/tika/dl/imagerec/dl4j-vgg16-config.xml
Jul 09, 2018 10:13:56 AM
org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed.
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.
Jul 09, 2018 10:13:56 AM
org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
WARNING: Tesseract OCR is installed and will be automatically applied to
image files unless
you've excluded the TesseractOCRParser from the default parser.
Tesseract may dramatically slow down content extraction (TIKA-2359).
As of Tika 1.15 (and prior versions), Tesseract is automatically called.
In future versions of Tika, users may need to turn the TesseractOCRParser on
via TikaConfig.
Jul 09, 2018 10:13:56 AM
org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
WARNING: org.xerial's sqlite-jdbc is not loaded.
Please provide the jar on your classpath to parse sqlite files.
See tika-parsers/pom.xml for the correct version.
INFO Starting Apache Tika 2.0.0-SNAPSHOT server
INFO Using custom config:
tika-dl/src/test/resources/org/apache/tika/dl/imagerec/dl4j-vgg16-config.xml
INFO Loaded [CpuBackend] backend
INFO Number of threads used for NativeOps: 2
INFO Number of threads used for BLAS: 2
INFO Backend used: [CPU]; OS: [Mac OS X]
INFO Cores: [4]; Memory: [3.6GB];
INFO Blas vendor: [MKL]
WARN Preprocessed Model doesn't exist at
/Users/mattmann/.tika-dl/models/dl4j/vgg-16/vgg16.zip
INFO Using cached model at
/Users/mattmann/.deeplearning4j/models/vgg16/vgg16_dl4j_inference.zip
INFO Verifying download...
INFO Checksum local is 3501732770, expecting 3501732770
INFO Starting ComputationGraph with WorkspaceModes set to [training: NONE;
inference: SINGLE], cacheMode set to [NONE]
INFO Saving the Loaded model for future use. Saved models are more
optimised to consume less resources.
INFO Recogniser = org.apache.tika.dl.imagerec.DL4JVGG16Net
INFO Recogniser Available = true
INFO Setting the server's publish address to be http://localhost:9998/
INFO jetty-8.y.z-SNAPSHOT
INFO Started SelectChannelConnector@localhost:9998
INFO Started Apache Tika server at http://localhost:9998/
INFO rmeta (autodetecting type)
INFO Time taken 1427ms
INFO Add RecognisedObject{label='lion' (eng), id='lion',
confidence=0.9999885559082031}
INFO Add RecognisedObject{label='chow' (eng), id='chow',
confidence=1.1340579476382118E-5}
INFO Add RecognisedObject{label='dhole' (eng), id='dhole',
confidence=8.046561106311856E-8}
```
## VGG16 client
```
nonas:imagerec mattmann$ curl -T lion.jpg http://localhost:9998/rmeta |
python -mjson.tool
% Total % Received % Xferd Average Speed Time Time Time
Current
Dload Upload Total Spent Left Speed
100 45971 0 1530 100 44441 916 26617 0:00:01 0:00:01 --:--:--
26627
[
{
"Content-Type": "image/jpeg",
"OBJECT": [
"lion (0.99999)",
"chow (0.00001)",
"dhole (0.00000)"
],
"X-Parsed-By": [
"org.apache.tika.parser.CompositeParser",
"org.apache.tika.parser.recognition.ObjectRecognitionParser"
],
"X-TIKA:content": "<html
xmlns=\"http://www.w3.org/1999/xhtml\">\n<head>\n<meta
name=\"org.apache.tika.parser.recognition.object.rec.impl\"
content=\"org.apache.tika.dl.imagerec.DL4JVGG16Net\" />\n<meta
name=\"X-Parsed-By\" content=\"org.apache.tika.parser.CompositeParser\"
/>\n<meta name=\"X-Parsed-By\"
content=\"org.apache.tika.parser.recognition.ObjectRecognitionParser\"
/>\n<meta name=\"OBJECT\" content=\"lion (0.99999)\" />\n<meta name=\"OBJECT\"
content=\"chow (0.00001)\" />\n<meta name=\"OBJECT\" content=\"dhole
(0.00000)\" />\n<meta name=\"Content-Type\" content=\"image/jpeg\"
/>\n<title></title>\n</head>\n<body><ol id=\"objects\">\t<li id=\"lion\"> lion
[eng](confidence = 0.999989)</li>\n\t<li id=\"chow\"> chow [eng](confidence =
0.000011)</li>\n\t<li id=\"dhole\"> dhole [eng](confidence =
0.000000)</li>\n</ol>\n</body></html>",
"X-TIKA:parse_time_millis": "1495",
"org.apache.tika.parser.recognition.object.rec.impl":
"org.apache.tika.dl.imagerec.DL4JVGG16Net"
}
]
nonas:imagerec mattmann$
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
> Upgrade dl4j to 1.0.0-beta
> --------------------------
>
> Key: TIKA-2672
> URL: https://issues.apache.org/jira/browse/TIKA-2672
> Project: Tika
> Issue Type: Task
> Reporter: Tim Allison
> Priority: Major
> Attachments: TIKA-2672.patch
>
>
> Let's try to upgrade dl4j. I think I got us most of the way there, but I got
> this error when reading the json config file. Can someone with more
> knowledge of layer specs help ([~thammegowda], perhaps :))?
> {noformat}
> org.deeplearning4j.exception.DL4JInvalidConfigException: Invalid
> configuration for layer (idx=-1, name=convolution2d_2, type=ConvolutionLayer)
> for width dimension: Invalid input configuration for kernel width. Require 0
> < kW <= inWidth + 2*padW; got (kW=3, inWidth=1, padW=0)
> Input type = InputTypeConvolutional(h=149,w=1,c=32), kernel = [3, 3], strides
> = [1, 1], padding = [0, 0], layer size (output channels) = 32, convolution
> mode = Truncate
> {noformat}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)