[ 
https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16154034#comment-16154034
 ] 

ASF GitHub Bot commented on NUTCH-1129:
---------------------------------------

simoncpu commented on issue #205: WIP: NUTCH-1129 microdata for Nutch 1.x
URL: https://github.com/apache/nutch/pull/205#issuecomment-327245880
 
 
   @lewismc Here's one of the URLs that I've tried:
   
   
[http://mcdonalds.jobs/salt-lake-city-ut/general-manager/2947B6E7B04147FFBEE1445E66D7EA67/job/](url)
   
   BTW, the previous patch was able to parse the Microdata without problems. :)
   
   EDIT, here's the full output:
   ```Thread FetcherThread has no more work available
   Using queue mode : byHost
   -finishing thread FetcherThread, activeThreads=1
   Fetcher: throughput threshold: -1
   Thread FetcherThread has no more work available
   Fetcher: throughput threshold retries: 5
   -finishing thread FetcherThread, activeThreads=1
   fetcher.maxNum.threads can't be < than 50 : using 50 instead
   -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0, 
fetchQueues.getQueueCount=1
   -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0, 
fetchQueues.getQueueCount=1
   Thread FetcherThread has no more work available
   -finishing thread FetcherThread, activeThreads=0
   -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0, 
fetchQueues.getQueueCount=0
   -activeThreads=0
   Fetcher: finished at 2017-09-05 17:25:43, elapsed: 00:00:08
   Parsing : 20170905172529
   /home/simoncpu/nutch/runtime/local/bin/nutch parse -D 
mapreduce.job.reduces=2 -D mapred.child.java.opts=-Xmx1000m -D 
mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D 
mapreduce.map.output.compress=true -D mapreduce.task.skip.start.attempts=2 -D 
mapreduce.map.skip.maxrecords=1 crawl-dir/segments/20170905172529
   ParseSegment: starting at 2017-09-05 17:25:45
   ParseSegment: segment: crawl-dir/segments/20170905172529
   Error parsing: 
http://mcdonalds.jobs/salt-lake-city-ut/general-manager/2947B6E7B04147FFBEE1445E66D7EA67/job/:
 failed(2,200): org.apache.nutch.parse.ParseException: Unable to successfully 
parse content
   Parsed 
(225ms):http://mcdonalds.jobs/salt-lake-city-ut/general-manager/2947B6E7B04147FFBEE1445E66D7EA67/job/
   ParseSegment: finished at 2017-09-05 17:25:51, elapsed: 00:00:06
   CrawlDB update
   /home/simoncpu/nutch/runtime/local/bin/nutch updatedb -D 
mapreduce.job.reduces=2 -D mapred.child.java.opts=-Xmx1000m -D 
mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D 
mapreduce.map.output.compress=true crawl-dir/crawldb 
crawl-dir/segments/20170905172529
   CrawlDb update: starting at 2017-09-05 17:25:53
   CrawlDb update: db: crawl-dir/crawldb
   CrawlDb update: segments: [crawl-dir/segments/20170905172529]
   CrawlDb update: additions allowed: true
   CrawlDb update: URL normalizing: false
   CrawlDb update: URL filtering: false
   CrawlDb update: 404 purging: false
   CrawlDb update: Merging segment data into db.
   CrawlDb update: finished at 2017-09-05 17:25:59, elapsed: 00:00:05
   Link inversion
   /home/simoncpu/nutch/runtime/local/bin/nutch invertlinks crawl-dir/linkdb 
crawl-dir/segments/20170905172529
   LinkDb: starting at 2017-09-05 17:26:01
   LinkDb: linkdb: crawl-dir/linkdb
   LinkDb: URL normalize: true
   LinkDb: URL filter: true
   LinkDb: internal links will be ignored.
   LinkDb: adding segment: crawl-dir/segments/20170905172529
   LinkDb: finished at 2017-09-05 17:26:06, elapsed: 00:00:04
   Dedup on crawldb
   /home/simoncpu/nutch/runtime/local/bin/nutch dedup crawl-dir/crawldb
   DeduplicationJob: starting at 2017-09-05 17:26:07
   Deduplication: 0 documents marked as duplicates
   Deduplication: Updating status of duplicate urls into crawl db.
   Deduplication finished at 2017-09-05 17:26:15, elapsed: 00:00:07
   Indexing 20170905172529 to index
   /home/simoncpu/nutch/runtime/local/bin/nutch index crawl-dir/crawldb -linkdb 
crawl-dir/linkdb crawl-dir/segments/20170905172529
   Segment dir is complete: crawl-dir/segments/20170905172529.
   Indexer: starting at 2017-09-05 17:26:17
   Indexer: deleting gone documents: false
   Indexer: URL filtering: false
   Indexer: URL normalizing: false
   Active IndexWriters :
   ElasticRestIndexWriter
           elastic.rest.host : hostname
           elastic.rest.port : port
           elastic.rest.index : elastic index command
           elastic.rest.max.bulk.docs : elastic bulk index doc counts. (default 
250)
           elastic.rest.max.bulk.size : elastic bulk index length. (default 
2500500 ~2.5MB)
   
   
   Indexer: number of documents indexed, deleted, or skipped:
   Indexer: finished at 2017-09-05 17:26:23, elapsed: 00:00:05
   Cleaning up index if possible
   /home/simoncpu/nutch/runtime/local/bin/nutch clean crawl-dir/crawldb
   Wed Sep 6 01:26:28 DST 2017 : Finished loop with 1 iterations
   ```
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


> Any23 Nutch plugin
> ------------------
>
>                 Key: NUTCH-1129
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1129
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Lewis John McGibbney
>            Assignee: Lewis John McGibbney
>            Priority: Minor
>             Fix For: 2.5
>
>         Attachments: NUTCH-1129.patch
>
>
> This plugin should build on the Any23 library to provide us with a plugin 
> which extracts RDF data from HTTP and file resources. Although as of writing 
> Any23 not part of the ASF, the project is working towards integration into 
> the Apache Incubator. Once the project proves its value, this would be an 
> excellent addition to the Nutch 1.X codebase. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to