[jira] [Updated] (NUTCH-1076) Solrindex has no documents following bin/nutch solrindex when using protocol-file

2014-07-07 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-1076:
-

Fix Version/s: (was: 1.9)
   1.10

> Solrindex has no documents following bin/nutch solrindex when using 
> protocol-file
> -
>
> Key: NUTCH-1076
> URL: https://issues.apache.org/jira/browse/NUTCH-1076
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: 1.3
> Environment: Ubuntu Linux 10.04 server
> JDK 1.6
> Nutch 1.3
> Solr 3.1.0
>Reporter: Seth Griffin
>Assignee: Markus Jelsma
>  Labels: nutch, protocol-file, solrindex
> Fix For: 1.10
>
>
> Note: When using protocol-http I am able to update solr effortlessly.
> To test this I have a single pdf file that I am trying to index in my urls 
> directory.
> I execute:
> bin/nutch crawl urls
> Output:
> solrUrl is not set, indexing will be skipped...
> crawl started in: crawl-20110805151045
> rootUrlDir = urls
> threads = 10
> depth = 5
> solrUrl=null
> Injector: starting at 2011-08-05 15:10:45
> Injector: crawlDb: crawl-20110805151045/crawldb
> Injector: urlDir: urls
> Injector: Converting injected urls to crawl db entries.
> Injector: Merging injected urls into crawl db.
> Injector: finished at 2011-08-05 15:10:48, elapsed: 00:00:02
> Generator: starting at 2011-08-05 15:10:48
> Generator: Selecting best-scoring urls due for fetch.
> Generator: filtering: true
> Generator: normalizing: true
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: Partitioning selected urls for politeness.
> Generator: segment: crawl-20110805151045/segments/20110805151050
> Generator: finished at 2011-08-05 15:10:51, elapsed: 00:00:03
> Fetcher: Your 'http.agent.name' value should be listed first in 
> 'http.robots.agents' property.
> Fetcher: starting at 2011-08-05 15:10:51
> Fetcher: segment: crawl-20110805151045/segments/20110805151050
> Fetcher: threads: 10
> QueueFeeder finished: total 1 records + hit by time limit :0
> fetching file:///home/nutch/nutch-1.3/runtime/local/indexdir/Altec.pdf
> -finishing thread FetcherThread, activeThreads=9
> -finishing thread FetcherThread, activeThreads=8
> -finishing thread FetcherThread, activeThreads=7
> -finishing thread FetcherThread, activeThreads=6
> -finishing thread FetcherThread, activeThreads=5
> -finishing thread FetcherThread, activeThreads=4
> -finishing thread FetcherThread, activeThreads=3
> -finishing thread FetcherThread, activeThreads=2
> -finishing thread FetcherThread, activeThreads=1
> -finishing thread FetcherThread, activeThreads=0
> -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
> -activeThreads=0
> Fetcher: finished at 2011-08-05 15:10:53, elapsed: 00:00:02
> ParseSegment: starting at 2011-08-05 15:10:53
> ParseSegment: segment: crawl-20110805151045/segments/20110805151050
> ParseSegment: finished at 2011-08-05 15:10:56, elapsed: 00:00:03
> CrawlDb update: starting at 2011-08-05 15:10:56
> CrawlDb update: db: crawl-20110805151045/crawldb
> CrawlDb update: segments: [crawl-20110805151045/segments/20110805151050]
> CrawlDb update: additions allowed: true
> CrawlDb update: URL normalizing: true
> CrawlDb update: URL filtering: true
> CrawlDb update: Merging segment data into db.
> CrawlDb update: finished at 2011-08-05 15:10:57, elapsed: 00:00:01
> Generator: starting at 2011-08-05 15:10:57
> Generator: Selecting best-scoring urls due for fetch.
> Generator: filtering: true
> Generator: normalizing: true
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: 0 records selected for fetching, exiting ...
> Stopping at depth=1 - no more URLs to fetch.
> LinkDb: starting at 2011-08-05 15:10:58
> LinkDb: linkdb: crawl-20110805151045/linkdb
> LinkDb: URL normalize: true
> LinkDb: URL filter: true
> LinkDb: adding segment: 
> file:/home/nutch/nutch-1.3/runtime/local/crawl-20110805151045/segments/20110805151050
> LinkDb: finished at 2011-08-05 15:10:59, elapsed: 00:00:01
> crawl finished: crawl-20110805151045
> Then with a clean solr index (stats output from stats.jsp below):
> searcherName : Searcher@14dd758 main
> caching : true
> numDocs : 0
> maxDoc : 0
> reader : 
> SolrIndexReader{this=1ee148b,r=ReadOnlyDirectoryReader@1ee148b,refCnt=1,segments=0}
> readerDir : 
> org.apache.lucene.store.NIOFSDirectory@/home/solr/apache-solr-3.1.0/example/solr/data/index
>  lockFactory=org.apache.lucene.store.NativeFSLockFactory@987197
> indexVersion : 1312575204101
> openedAt : Fri Aug 05 15:13:24 CDT 2011
> registeredAt : Fri Aug 05 15:13:24 CDT 2011
> warmupTime : 0 
> I then execute:
> bin/nutch solrindex http://localhost:8983/solr/ crawl-20110805151045/crawldb/ 
> crawl-20110805151045/linkdb/ crawl-20110805151045/segments/*
> b

[jira] [Updated] (NUTCH-1076) Solrindex has no documents following bin/nutch solrindex when using protocol-file

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1076:


Fix Version/s: 1.7

> Solrindex has no documents following bin/nutch solrindex when using 
> protocol-file
> -
>
> Key: NUTCH-1076
> URL: https://issues.apache.org/jira/browse/NUTCH-1076
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: 1.3
> Environment: Ubuntu Linux 10.04 server
> JDK 1.6
> Nutch 1.3
> Solr 3.1.0
>Reporter: Seth Griffin
>Assignee: Markus Jelsma
>  Labels: nutch, protocol-file, solrindex
> Fix For: 1.7
>
>
> Note: When using protocol-http I am able to update solr effortlessly.
> To test this I have a single pdf file that I am trying to index in my urls 
> directory.
> I execute:
> bin/nutch crawl urls
> Output:
> solrUrl is not set, indexing will be skipped...
> crawl started in: crawl-20110805151045
> rootUrlDir = urls
> threads = 10
> depth = 5
> solrUrl=null
> Injector: starting at 2011-08-05 15:10:45
> Injector: crawlDb: crawl-20110805151045/crawldb
> Injector: urlDir: urls
> Injector: Converting injected urls to crawl db entries.
> Injector: Merging injected urls into crawl db.
> Injector: finished at 2011-08-05 15:10:48, elapsed: 00:00:02
> Generator: starting at 2011-08-05 15:10:48
> Generator: Selecting best-scoring urls due for fetch.
> Generator: filtering: true
> Generator: normalizing: true
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: Partitioning selected urls for politeness.
> Generator: segment: crawl-20110805151045/segments/20110805151050
> Generator: finished at 2011-08-05 15:10:51, elapsed: 00:00:03
> Fetcher: Your 'http.agent.name' value should be listed first in 
> 'http.robots.agents' property.
> Fetcher: starting at 2011-08-05 15:10:51
> Fetcher: segment: crawl-20110805151045/segments/20110805151050
> Fetcher: threads: 10
> QueueFeeder finished: total 1 records + hit by time limit :0
> fetching file:///home/nutch/nutch-1.3/runtime/local/indexdir/Altec.pdf
> -finishing thread FetcherThread, activeThreads=9
> -finishing thread FetcherThread, activeThreads=8
> -finishing thread FetcherThread, activeThreads=7
> -finishing thread FetcherThread, activeThreads=6
> -finishing thread FetcherThread, activeThreads=5
> -finishing thread FetcherThread, activeThreads=4
> -finishing thread FetcherThread, activeThreads=3
> -finishing thread FetcherThread, activeThreads=2
> -finishing thread FetcherThread, activeThreads=1
> -finishing thread FetcherThread, activeThreads=0
> -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
> -activeThreads=0
> Fetcher: finished at 2011-08-05 15:10:53, elapsed: 00:00:02
> ParseSegment: starting at 2011-08-05 15:10:53
> ParseSegment: segment: crawl-20110805151045/segments/20110805151050
> ParseSegment: finished at 2011-08-05 15:10:56, elapsed: 00:00:03
> CrawlDb update: starting at 2011-08-05 15:10:56
> CrawlDb update: db: crawl-20110805151045/crawldb
> CrawlDb update: segments: [crawl-20110805151045/segments/20110805151050]
> CrawlDb update: additions allowed: true
> CrawlDb update: URL normalizing: true
> CrawlDb update: URL filtering: true
> CrawlDb update: Merging segment data into db.
> CrawlDb update: finished at 2011-08-05 15:10:57, elapsed: 00:00:01
> Generator: starting at 2011-08-05 15:10:57
> Generator: Selecting best-scoring urls due for fetch.
> Generator: filtering: true
> Generator: normalizing: true
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: 0 records selected for fetching, exiting ...
> Stopping at depth=1 - no more URLs to fetch.
> LinkDb: starting at 2011-08-05 15:10:58
> LinkDb: linkdb: crawl-20110805151045/linkdb
> LinkDb: URL normalize: true
> LinkDb: URL filter: true
> LinkDb: adding segment: 
> file:/home/nutch/nutch-1.3/runtime/local/crawl-20110805151045/segments/20110805151050
> LinkDb: finished at 2011-08-05 15:10:59, elapsed: 00:00:01
> crawl finished: crawl-20110805151045
> Then with a clean solr index (stats output from stats.jsp below):
> searcherName : Searcher@14dd758 main
> caching : true
> numDocs : 0
> maxDoc : 0
> reader : 
> SolrIndexReader{this=1ee148b,r=ReadOnlyDirectoryReader@1ee148b,refCnt=1,segments=0}
> readerDir : 
> org.apache.lucene.store.NIOFSDirectory@/home/solr/apache-solr-3.1.0/example/solr/data/index
>  lockFactory=org.apache.lucene.store.NativeFSLockFactory@987197
> indexVersion : 1312575204101
> openedAt : Fri Aug 05 15:13:24 CDT 2011
> registeredAt : Fri Aug 05 15:13:24 CDT 2011
> warmupTime : 0 
> I then execute:
> bin/nutch solrindex http://localhost:8983/solr/ crawl-20110805151045/crawldb/ 
> crawl-20110805151045/linkdb/ crawl-20110805151045/segments/*
> bin/nutch output:
>