[
https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16154034#comment-16154034
]
ASF GitHub Bot commented on NUTCH-1129:
---------------------------------------
simoncpu commented on issue #205: WIP: NUTCH-1129 microdata for Nutch 1.x
URL: https://github.com/apache/nutch/pull/205#issuecomment-327245880
@lewismc Here's one of the URLs that I've tried:
[http://mcdonalds.jobs/salt-lake-city-ut/general-manager/2947B6E7B04147FFBEE1445E66D7EA67/job/](url)
BTW, the previous patch was able to parse the Microdata without problems. :)
EDIT, here's the full output:
```Thread FetcherThread has no more work available
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Fetcher: throughput threshold: -1
Thread FetcherThread has no more work available
Fetcher: throughput threshold retries: 5
-finishing thread FetcherThread, activeThreads=1
fetcher.maxNum.threads can't be < than 50 : using 50 instead
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0,
fetchQueues.getQueueCount=1
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0,
fetchQueues.getQueueCount=1
Thread FetcherThread has no more work available
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0,
fetchQueues.getQueueCount=0
-activeThreads=0
Fetcher: finished at 2017-09-05 17:25:43, elapsed: 00:00:08
Parsing : 20170905172529
/home/simoncpu/nutch/runtime/local/bin/nutch parse -D
mapreduce.job.reduces=2 -D mapred.child.java.opts=-Xmx1000m -D
mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D
mapreduce.map.output.compress=true -D mapreduce.task.skip.start.attempts=2 -D
mapreduce.map.skip.maxrecords=1 crawl-dir/segments/20170905172529
ParseSegment: starting at 2017-09-05 17:25:45
ParseSegment: segment: crawl-dir/segments/20170905172529
Error parsing:
http://mcdonalds.jobs/salt-lake-city-ut/general-manager/2947B6E7B04147FFBEE1445E66D7EA67/job/:
failed(2,200): org.apache.nutch.parse.ParseException: Unable to successfully
parse content
Parsed
(225ms):http://mcdonalds.jobs/salt-lake-city-ut/general-manager/2947B6E7B04147FFBEE1445E66D7EA67/job/
ParseSegment: finished at 2017-09-05 17:25:51, elapsed: 00:00:06
CrawlDB update
/home/simoncpu/nutch/runtime/local/bin/nutch updatedb -D
mapreduce.job.reduces=2 -D mapred.child.java.opts=-Xmx1000m -D
mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D
mapreduce.map.output.compress=true crawl-dir/crawldb
crawl-dir/segments/20170905172529
CrawlDb update: starting at 2017-09-05 17:25:53
CrawlDb update: db: crawl-dir/crawldb
CrawlDb update: segments: [crawl-dir/segments/20170905172529]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: false
CrawlDb update: URL filtering: false
CrawlDb update: 404 purging: false
CrawlDb update: Merging segment data into db.
CrawlDb update: finished at 2017-09-05 17:25:59, elapsed: 00:00:05
Link inversion
/home/simoncpu/nutch/runtime/local/bin/nutch invertlinks crawl-dir/linkdb
crawl-dir/segments/20170905172529
LinkDb: starting at 2017-09-05 17:26:01
LinkDb: linkdb: crawl-dir/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: internal links will be ignored.
LinkDb: adding segment: crawl-dir/segments/20170905172529
LinkDb: finished at 2017-09-05 17:26:06, elapsed: 00:00:04
Dedup on crawldb
/home/simoncpu/nutch/runtime/local/bin/nutch dedup crawl-dir/crawldb
DeduplicationJob: starting at 2017-09-05 17:26:07
Deduplication: 0 documents marked as duplicates
Deduplication: Updating status of duplicate urls into crawl db.
Deduplication finished at 2017-09-05 17:26:15, elapsed: 00:00:07
Indexing 20170905172529 to index
/home/simoncpu/nutch/runtime/local/bin/nutch index crawl-dir/crawldb -linkdb
crawl-dir/linkdb crawl-dir/segments/20170905172529
Segment dir is complete: crawl-dir/segments/20170905172529.
Indexer: starting at 2017-09-05 17:26:17
Indexer: deleting gone documents: false
Indexer: URL filtering: false
Indexer: URL normalizing: false
Active IndexWriters :
ElasticRestIndexWriter
elastic.rest.host : hostname
elastic.rest.port : port
elastic.rest.index : elastic index command
elastic.rest.max.bulk.docs : elastic bulk index doc counts. (default
250)
elastic.rest.max.bulk.size : elastic bulk index length. (default
2500500 ~2.5MB)
Indexer: number of documents indexed, deleted, or skipped:
Indexer: finished at 2017-09-05 17:26:23, elapsed: 00:00:05
Cleaning up index if possible
/home/simoncpu/nutch/runtime/local/bin/nutch clean crawl-dir/crawldb
Wed Sep 6 01:26:28 DST 2017 : Finished loop with 1 iterations
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
> Any23 Nutch plugin
> ------------------
>
> Key: NUTCH-1129
> URL: https://issues.apache.org/jira/browse/NUTCH-1129
> Project: Nutch
> Issue Type: New Feature
> Components: parser
> Reporter: Lewis John McGibbney
> Assignee: Lewis John McGibbney
> Priority: Minor
> Fix For: 2.5
>
> Attachments: NUTCH-1129.patch
>
>
> This plugin should build on the Any23 library to provide us with a plugin
> which extracts RDF data from HTTP and file resources. Although as of writing
> Any23 not part of the ASF, the project is working towards integration into
> the Apache Incubator. Once the project proves its value, this would be an
> excellent addition to the Nutch 1.X codebase.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)