Dear Wiki user, You have subscribed to a wiki page or wiki category on "Tika Wiki" for change notification.
The "AgeDetectionParser" page has been changed by ChrisMattmann: https://wiki.apache.org/tika/AgeDetectionParser?action=diff&rev1=2&rev2=3 = Pre-requisites = - None + None, all of the needed USC IRDS models (for age classification ```classify-bigram.bin```, and ```classify-bigram.bin```) are downloaded automatically and will be available in your ```$TIKASRC/tika-parsers/model``` directory. + = Tests to Run Beforehand = + + It's worth trying to see if Age Prediction works for you before using it in Tika. To do so: + + == Download and Build AgePredictor == + + 1. cd $HOME/src && git clone https://github.com/USCDataScience/AgePredictor.git + 2. cd !AgePredictor && mvn install + + == Test AgePredictor == + + 1. ```java -cp age-predictor-assembly/target/age-predictor-assembly-1.1-SNAPSHOT-jar-with-dependencies.jar edu.usc.irds.agepredictor.authorage.AgePredicterLocal I am actually very young now``` + + The above should print something like: + + {{{ + $ java -cp age-predictor-assembly/target/age-predictor-assembly-1.1-SNAPSHOT-jar-with-dependencies.jar edu.usc.irds.agepredictor.authorage.AgePredicterLocal I am actually very young now + Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties + 17/07/06 23:23:24 INFO SparkContext: Running Spark version 2.0.0 + 17/07/06 23:23:24 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable + 17/07/06 23:23:24 INFO SecurityManager: Changing view acls to: mattmann + 17/07/06 23:23:24 INFO SecurityManager: Changing modify acls to: mattmann + 17/07/06 23:23:24 INFO SecurityManager: Changing view acls groups to: + 17/07/06 23:23:24 INFO SecurityManager: Changing modify acls groups to: + 17/07/06 23:23:24 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(mattmann); groups with view permissions: Set(); users with modify permissions: Set(mattmann); groups with modify permissions: Set() + 17/07/06 23:23:25 INFO Utils: Successfully started service 'sparkDriver' on port 54970. + 17/07/06 23:23:25 INFO SparkEnv: Registering MapOutputTracker + 17/07/06 23:23:25 INFO SparkEnv: Registering BlockManagerMaster + 17/07/06 23:23:25 INFO DiskBlockManager: Created local directory at /private/var/folders/n5/1d_k3z4s2293q8ntx_n8sw54mm5n_8/T/blockmgr-aa033554-acff-4ea1-a5d1-250257f467dc + 17/07/06 23:23:25 INFO MemoryStore: MemoryStore started with capacity 2004.6 MB + 17/07/06 23:23:25 INFO SparkEnv: Registering OutputCommitCoordinator + 17/07/06 23:23:25 INFO Utils: Successfully started service 'SparkUI' on port 4040. + 17/07/06 23:23:25 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://172.20.10.2:4040 + 17/07/06 23:23:25 INFO Executor: Starting executor ID driver on host localhost + 17/07/06 23:23:25 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 54971. + 17/07/06 23:23:25 INFO NettyBlockTransferService: Server created on 172.20.10.2:54971 + 17/07/06 23:23:25 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 172.20.10.2, 54971) + 17/07/06 23:23:25 INFO BlockManagerMasterEndpoint: Registering block manager 172.20.10.2:54971 with 2004.6 MB RAM, BlockManagerId(driver, 172.20.10.2, 54971) + 17/07/06 23:23:25 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 172.20.10.2, 54971) + 17/07/06 23:23:26 WARN SparkContext: Use an existing SparkContext, some configuration may not take effect. + 17/07/06 23:23:26 INFO SharedState: Warehouse path is 'file:/Users/mattmann/git/AgePredictor/spark-warehouse'. + 17/07/06 23:23:39 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 6.1 MB, free 1998.5 MB) + 17/07/06 23:23:39 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 488.5 KB, free 1998.0 MB) + 17/07/06 23:23:39 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 172.20.10.2:54971 (size: 488.5 KB, free: 2004.1 MB) + 17/07/06 23:23:39 INFO SparkContext: Created broadcast 0 from broadcast at CountVectorizer.scala:243 + 17/07/06 23:23:42 INFO CodeGenerator: Code generated in 173.162228 ms + 17/07/06 23:23:42 INFO SparkContext: Starting job: first at AgePredicterLocal.java:114 + 17/07/06 23:23:42 INFO DAGScheduler: Got job 0 (first at AgePredicterLocal.java:114) with 1 output partitions + 17/07/06 23:23:42 INFO DAGScheduler: Final stage: ResultStage 0 (first at AgePredicterLocal.java:114) + 17/07/06 23:23:42 INFO DAGScheduler: Parents of final stage: List() + 17/07/06 23:23:42 INFO DAGScheduler: Missing parents: List() + 17/07/06 23:23:42 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[3] at javaRDD at AgePredicterLocal.java:112), which has no missing parents + 17/07/06 23:23:42 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 10.9 KB, free 1998.0 MB) + 17/07/06 23:23:42 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 5.4 KB, free 1998.0 MB) + 17/07/06 23:23:42 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 172.20.10.2:54971 (size: 5.4 KB, free: 2004.1 MB) + 17/07/06 23:23:42 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1012 + 17/07/06 23:23:42 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 0 (MapPartitionsRDD[3] at javaRDD at AgePredicterLocal.java:112) + 17/07/06 23:23:42 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks + 17/07/06 23:23:42 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, partition 0, PROCESS_LOCAL, 6671 bytes) + 17/07/06 23:23:42 INFO Executor: Running task 0.0 in stage 0.0 (TID 0) + 17/07/06 23:23:42 INFO CodeGenerator: Code generated in 21.816953 ms + 17/07/06 23:23:42 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 3381 bytes result sent to driver + 17/07/06 23:23:42 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 123 ms on localhost (1/1) + 17/07/06 23:23:42 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool + 17/07/06 23:23:42 INFO DAGScheduler: ResultStage 0 (first at AgePredicterLocal.java:114) finished in 0.139 s + 17/07/06 23:23:42 INFO DAGScheduler: Job 0 finished: first at AgePredicterLocal.java:114, took 0.221228 s + + =================== + + Text received- 'I am actually very young now ' + Predicted Age - 34.983567 + + + =================== + + 17/07/06 23:23:43 INFO SparkContext: Invoking stop() from shutdown hook + 17/07/06 23:23:43 INFO SparkUI: Stopped Spark web UI at http://172.20.10.2:4040 + 17/07/06 23:23:43 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped! + 17/07/06 23:23:43 INFO MemoryStore: MemoryStore cleared + 17/07/06 23:23:43 INFO BlockManager: BlockManager stopped + 17/07/06 23:23:43 INFO BlockManagerMaster: BlockManagerMaster stopped + 17/07/06 23:23:43 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped! + 17/07/06 23:23:43 INFO SparkContext: Successfully stopped SparkContext + 17/07/06 23:23:43 INFO ShutdownHookManager: Shutdown hook called + 17/07/06 23:23:43 INFO ShutdownHookManager: Deleting directory /private/var/folders/n5/1d_k3z4s2293q8ntx_n8sw54mm5n_8/T/spark-31e98436-00e3-4020-a4c5-784ec75b16de + $ + }}} + + In this case you are good shape. If not, please report a bug [[https://github.com/USCDataScience/AgePredictor/issues/|here]] + + = Running AgeRecogniser (Tika's Parser) = + + To run AgeRecogniser, download and install Tika 1.16 or later, and then run the following (make sure you have a $TIKASRC/tika-parsers/model directory populated with models before running this per above) + + 1. ```cd $HOME/src/ && git clone https://github.com/apache/tika.git``` + 2. ```cd tika-parsers && echo "I am a test file" > test.txt``` + 2. ```java -cp ../tika-app/target/tika-app-1.16-SNAPSHOT.jar:./model org.apache.tika.cli.TikaCLI --config=src/test/resources/org/apache/tika/parser/recognition/tika-config-age.xml -m test.txt``` + + You should then see: + + {{{ + $java -cp ../tika-app/target/tika-app-1.16-SNAPSHOT.jar:./model org.apache.tika.cli.TikaCLI --config=src/test/resources/org/apache/tika/parser/recognition/tika-config-age.xml -m test.txt + Jul 06, 2017 11:31:49 PM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem + WARNING: com.levigo.jbig2.JBIG2ImageReader not on class path. The ImageParser will skip jbig2 images + Jul 06, 2017 11:31:49 PM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem + WARNING: JBIG2ImageReader not loaded. jbig2 files will be ignored + See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io + for optional dependencies. + TIFFImageWriter not loaded. tiff files will not be processed + See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io + for optional dependencies. + J2KImageReader not loaded. JPEG2000 files will not be processed. + See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io + for optional dependencies. + + Jul 06, 2017 11:31:49 PM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem + WARNING: Tesseract OCR is installed and will be automatically applied to image files. + This may dramatically slow down content extraction (TIKA-2359). + As of Tika 1.15 (and prior versions), Tesseract is automatically called. + In future versions of Tika, users may need to turn the TesseractOCRParser on via TikaConfig. + Jul 06, 2017 11:31:49 PM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem + WARNING: org.xerial's sqlite-jdbc is not loaded. + Please provide the jar on your classpath to parse sqlite files. + See tika-parsers/pom.xml for the correct version. + Jul 06, 2017 11:31:49 PM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem + WARNING: com.levigo.jbig2.JBIG2ImageReader not on class path. The ImageParser will skip jbig2 images + Jul 06, 2017 11:31:49 PM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem + WARNING: JBIG2ImageReader not loaded. jbig2 files will be ignored + See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io + for optional dependencies. + TIFFImageWriter not loaded. tiff files will not be processed + See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io + for optional dependencies. + J2KImageReader not loaded. JPEG2000 files will not be processed. + See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io + for optional dependencies. + + Jul 06, 2017 11:31:49 PM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem + WARNING: Tesseract OCR is installed and will be automatically applied to image files. + This may dramatically slow down content extraction (TIKA-2359). + As of Tika 1.15 (and prior versions), Tesseract is automatically called. + In future versions of Tika, users may need to turn the TesseractOCRParser on via TikaConfig. + Jul 06, 2017 11:31:49 PM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem + WARNING: org.xerial's sqlite-jdbc is not loaded. + Please provide the jar on your classpath to parse sqlite files. + See tika-parsers/pom.xml for the correct version. + INFO Running Spark version 2.0.0 + WARN Unable to load native-hadoop library for your platform... using builtin-java classes where applicable + INFO Changing view acls to: mattmann + INFO Changing modify acls to: mattmann + INFO Changing view acls groups to: + INFO Changing modify acls groups to: + INFO SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(mattmann); groups with view permissions: Set(); users with modify permissions: Set(mattmann); groups with modify permissions: Set() + INFO Successfully started service 'sparkDriver' on port 55146. + INFO Registering MapOutputTracker + INFO Registering BlockManagerMaster + INFO Created local directory at /private/var/folders/n5/1d_k3z4s2293q8ntx_n8sw54mm5n_8/T/blockmgr-8f68cc3d-a90e-4b6e-a634-102200c16a64 + INFO MemoryStore started with capacity 2004.6 MB + INFO Registering OutputCommitCoordinator + INFO Logging initialized @1214ms + INFO jetty-9.2.z-SNAPSHOT + INFO Started o.s.j.s.ServletContextHandler@f73dcd6{/jobs,null,AVAILABLE} + INFO Started o.s.j.s.ServletContextHandler@5c87bfe2{/jobs/json,null,AVAILABLE} + INFO Started o.s.j.s.ServletContextHandler@2fea7088{/jobs/job,null,AVAILABLE} + INFO Started o.s.j.s.ServletContextHandler@40499e4f{/jobs/job/json,null,AVAILABLE} + INFO Started o.s.j.s.ServletContextHandler@51cd7ffc{/stages,null,AVAILABLE} + INFO Started o.s.j.s.ServletContextHandler@30d4b288{/stages/json,null,AVAILABLE} + INFO Started o.s.j.s.ServletContextHandler@4cc6fa2a{/stages/stage,null,AVAILABLE} + INFO Started o.s.j.s.ServletContextHandler@40f1be1b{/stages/stage/json,null,AVAILABLE} + INFO Started o.s.j.s.ServletContextHandler@7a791b66{/stages/pool,null,AVAILABLE} + INFO Started o.s.j.s.ServletContextHandler@6f2cb653{/stages/pool/json,null,AVAILABLE} + INFO Started o.s.j.s.ServletContextHandler@14c01636{/storage,null,AVAILABLE} + INFO Started o.s.j.s.ServletContextHandler@590c73d3{/storage/json,null,AVAILABLE} + INFO Started o.s.j.s.ServletContextHandler@6b9ce1bf{/storage/rdd,null,AVAILABLE} + INFO Started o.s.j.s.ServletContextHandler@61884cb1{/storage/rdd/json,null,AVAILABLE} + INFO Started o.s.j.s.ServletContextHandler@75ed9710{/environment,null,AVAILABLE} + INFO Started o.s.j.s.ServletContextHandler@4fc5e095{/environment/json,null,AVAILABLE} + INFO Started o.s.j.s.ServletContextHandler@435871cb{/executors,null,AVAILABLE} + INFO Started o.s.j.s.ServletContextHandler@609640d5{/executors/json,null,AVAILABLE} + INFO Started o.s.j.s.ServletContextHandler@79da1ec0{/executors/threadDump,null,AVAILABLE} + INFO Started o.s.j.s.ServletContextHandler@19fb8826{/executors/threadDump/json,null,AVAILABLE} + INFO Started o.s.j.s.ServletContextHandler@192d74fb{/static,null,AVAILABLE} + INFO Started o.s.j.s.ServletContextHandler@4bef0fe3{/,null,AVAILABLE} + INFO Started o.s.j.s.ServletContextHandler@62ea3440{/api,null,AVAILABLE} + INFO Started o.s.j.s.ServletContextHandler@27953a83{/stages/stage/kill,null,AVAILABLE} + INFO Started ServerConnector@25748410{HTTP/1.1}{0.0.0.0:4040} + INFO Started @1291ms + INFO Successfully started service 'SparkUI' on port 4040. + INFO Bound SparkUI to 0.0.0.0, and started at http://172.20.10.2:4040 + INFO Starting executor ID driver on host localhost + INFO Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 55147. + INFO Server created on 172.20.10.2:55147 + INFO Registering BlockManager BlockManagerId(driver, 172.20.10.2, 55147) + INFO Registering block manager 172.20.10.2:55147 with 2004.6 MB RAM, BlockManagerId(driver, 172.20.10.2, 55147) + INFO Registered BlockManager BlockManagerId(driver, 172.20.10.2, 55147) + INFO Started o.s.j.s.ServletContextHandler@5305c37d{/metrics/json,null,AVAILABLE} + WARN Use an existing SparkContext, some configuration may not take effect. + INFO Started o.s.j.s.ServletContextHandler@3c1e3314{/SQL,null,AVAILABLE} + INFO Started o.s.j.s.ServletContextHandler@78e16155{/SQL/json,null,AVAILABLE} + INFO Started o.s.j.s.ServletContextHandler@50b0bc4c{/SQL/execution,null,AVAILABLE} + INFO Started o.s.j.s.ServletContextHandler@13c612bd{/SQL/execution/json,null,AVAILABLE} + INFO Started o.s.j.s.ServletContextHandler@28fa700e{/static/sql,null,AVAILABLE} + INFO Warehouse path is 'file:/Users/mattmann/tmp/tika1.15/tika-parsers/spark-warehouse'. + INFO Block broadcast_0 stored as values in memory (estimated size 6.1 MB, free 1998.5 MB) + INFO Block broadcast_0_piece0 stored as bytes in memory (estimated size 488.5 KB, free 1998.0 MB) + INFO Added broadcast_0_piece0 in memory on 172.20.10.2:55147 (size: 488.5 KB, free: 2004.1 MB) + INFO Created broadcast 0 from broadcast at CountVectorizer.scala:243 + INFO Code generated in 1362.73497 ms + INFO Starting job: first at AgePredicterLocal.java:114 + INFO Got job 0 (first at AgePredicterLocal.java:114) with 1 output partitions + INFO Final stage: ResultStage 0 (first at AgePredicterLocal.java:114) + INFO Parents of final stage: List() + INFO Missing parents: List() + INFO Submitting ResultStage 0 (MapPartitionsRDD[3] at javaRDD at AgePredicterLocal.java:112), which has no missing parents + INFO Block broadcast_1 stored as values in memory (estimated size 10.5 KB, free 1998.0 MB) + INFO Block broadcast_1_piece0 stored as bytes in memory (estimated size 5.3 KB, free 1998.0 MB) + INFO Added broadcast_1_piece0 in memory on 172.20.10.2:55147 (size: 5.3 KB, free: 2004.1 MB) + INFO Created broadcast 1 from broadcast at DAGScheduler.scala:1012 + INFO Submitting 1 missing tasks from ResultStage 0 (MapPartitionsRDD[3] at javaRDD at AgePredicterLocal.java:112) + INFO Adding task set 0.0 with 1 tasks + INFO Starting task 0.0 in stage 0.0 (TID 0, localhost, partition 0, PROCESS_LOCAL, 6477 bytes) + INFO Running task 0.0 in stage 0.0 (TID 0) + INFO Code generated in 14.190189 ms + INFO Finished task 0.0 in stage 0.0 (TID 0). 3228 bytes result sent to driver + INFO Finished task 0.0 in stage 0.0 (TID 0) in 74 ms on localhost (1/1) + INFO Removed TaskSet 0.0, whose tasks have all completed, from pool + INFO ResultStage 0 (first at AgePredicterLocal.java:114) finished in 0.083 s + INFO Job 0 finished: first at AgePredicterLocal.java:114, took 0.128993 s + Content-Length: 17 + Content-Type: text/plain + Estimated-Author-Age: 32.29913797083779 + X-Parsed-By: org.apache.tika.parser.CompositeParser + X-Parsed-By: org.apache.tika.parser.recognition.AgeRecogniser + resourceName: test.txt + INFO Invoking stop() from shutdown hook + INFO Stopped ServerConnector@25748410{HTTP/1.1}{0.0.0.0:4040} + INFO Stopped o.s.j.s.ServletContextHandler@27953a83{/stages/stage/kill,null,UNAVAILABLE} + INFO Stopped o.s.j.s.ServletContextHandler@62ea3440{/api,null,UNAVAILABLE} + INFO Stopped o.s.j.s.ServletContextHandler@4bef0fe3{/,null,UNAVAILABLE} + INFO Stopped o.s.j.s.ServletContextHandler@192d74fb{/static,null,UNAVAILABLE} + INFO Stopped o.s.j.s.ServletContextHandler@19fb8826{/executors/threadDump/json,null,UNAVAILABLE} + INFO Stopped o.s.j.s.ServletContextHandler@79da1ec0{/executors/threadDump,null,UNAVAILABLE} + INFO Stopped o.s.j.s.ServletContextHandler@609640d5{/executors/json,null,UNAVAILABLE} + INFO Stopped o.s.j.s.ServletContextHandler@435871cb{/executors,null,UNAVAILABLE} + INFO Stopped o.s.j.s.ServletContextHandler@4fc5e095{/environment/json,null,UNAVAILABLE} + INFO Stopped o.s.j.s.ServletContextHandler@75ed9710{/environment,null,UNAVAILABLE} + INFO Stopped o.s.j.s.ServletContextHandler@61884cb1{/storage/rdd/json,null,UNAVAILABLE} + INFO Stopped o.s.j.s.ServletContextHandler@6b9ce1bf{/storage/rdd,null,UNAVAILABLE} + INFO Stopped o.s.j.s.ServletContextHandler@590c73d3{/storage/json,null,UNAVAILABLE} + INFO Stopped o.s.j.s.ServletContextHandler@14c01636{/storage,null,UNAVAILABLE} + INFO Stopped o.s.j.s.ServletContextHandler@6f2cb653{/stages/pool/json,null,UNAVAILABLE} + INFO Stopped o.s.j.s.ServletContextHandler@7a791b66{/stages/pool,null,UNAVAILABLE} + INFO Stopped o.s.j.s.ServletContextHandler@40f1be1b{/stages/stage/json,null,UNAVAILABLE} + INFO Stopped o.s.j.s.ServletContextHandler@4cc6fa2a{/stages/stage,null,UNAVAILABLE} + INFO Stopped o.s.j.s.ServletContextHandler@30d4b288{/stages/json,null,UNAVAILABLE} + INFO Stopped o.s.j.s.ServletContextHandler@51cd7ffc{/stages,null,UNAVAILABLE} + INFO Stopped o.s.j.s.ServletContextHandler@40499e4f{/jobs/job/json,null,UNAVAILABLE} + INFO Stopped o.s.j.s.ServletContextHandler@2fea7088{/jobs/job,null,UNAVAILABLE} + INFO Stopped o.s.j.s.ServletContextHandler@5c87bfe2{/jobs/json,null,UNAVAILABLE} + INFO Stopped o.s.j.s.ServletContextHandler@f73dcd6{/jobs,null,UNAVAILABLE} + INFO Stopped Spark web UI at http://172.20.10.2:4040 + INFO MapOutputTrackerMasterEndpoint stopped! + INFO MemoryStore cleared + INFO BlockManager stopped + INFO BlockManagerMaster stopped + INFO OutputCommitCoordinator stopped! + INFO Successfully stopped SparkContext + INFO Shutdown hook called + INFO Deleting directory /private/var/folders/n5/1d_k3z4s2293q8ntx_n8sw54mm5n_8/T/spark-68607012-1ad6-4919-ac2a-36c135f20ae8 + $ + }}} +
