[jira] [Closed] (NUTCH-2114) kkk

2015-09-20 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche closed NUTCH-2114.

Resolution: Invalid

> kkk
> ---
>
> Key: NUTCH-2114
> URL: https://issues.apache.org/jira/browse/NUTCH-2114
> Project: Nutch
>  Issue Type: Bug
>  Components: administration gui, commoncrawl, injector
>Reporter: Badreddine Ahmed
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (NUTCH-1946) Upgrade to Gora 0.6.1

2015-09-20 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-1946.
-
Resolution: Fixed

Committed @revision 1704128 in 2.X HEAD

> Upgrade to Gora 0.6.1
> -
>
> Key: NUTCH-1946
> URL: https://issues.apache.org/jira/browse/NUTCH-1946
> Project: Nutch
>  Issue Type: Improvement
>  Components: storage
>Affects Versions: 2.3.1
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Critical
> Fix For: 2.3.1
>
> Attachments: NUTCH-1946.patch, NUTCH-1946_Gora_fixes.patch, 
> NUTCH-1946v2.patch, NUTCH-1946v3.patch, NUTCH-1946v4.patch
>
>
> Apache Gora 0.6.1 was released recently.
> We should upgrade before pushing Nutch 2.3.1 as it will come in very handy 
> for the new Docker containers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (NUTCH-1286) Refactoring/reimplementing crawling API (NutchApp)

2015-09-20 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-1286.
-
Resolution: Won't Fix

> Refactoring/reimplementing crawling API (NutchApp)
> --
>
> Key: NUTCH-1286
> URL: https://issues.apache.org/jira/browse/NUTCH-1286
> Project: Nutch
>  Issue Type: Improvement
>  Components: administration gui, REST_api, web gui
>Reporter: Ferdy Galema
>  Labels: gsoc2014
> Fix For: 2.3.1
>
>
> This issue is to track changes we (Mathijs and I) have planned for the API 
> and webapp in Nutchgora. We have a pretty good idea of how we want to be 
> using the crawl API. It may involve some major refactoring or perhaps a side 
> implementation next the current NutchApp functionality. It depends on how 
> much we can reuse the existing components. The bottom line is that there will 
> be a strictly defined Java API that provide everyting related from 
> crawling/indexing to job control. (Listing jobs, tracking progress and 
> aborting jobs being part of it). There will be no server or service for 
> tracking crawling states, all will be persisted one way or the other and 
> queryable from the API. The REST server shall be a very thin layer on top of 
> the Java implementation. A rich web interface will be very easy layer too, 
> once we have a cleanly (but extensive) defined API. But we will start to make 
> to API usable from a simple command-line interface.
> More details will be provided later on.. feel free to comment if you have 
> suggestions/questions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (NUTCH-2101) Upgrade Nutch 2.X to Hadoop 2.5.1

2015-09-20 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-2101.
-
Resolution: Fixed

resolved in NUTCH-1946

> Upgrade Nutch 2.X to Hadoop 2.5.1
> -
>
> Key: NUTCH-2101
> URL: https://issues.apache.org/jira/browse/NUTCH-2101
> Project: Nutch
>  Issue Type: Bug
>  Components: build
>Affects Versions: 2.3.1
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 2.3.1
>
>
> As we did over on NUTCH-2049, we should upgrade Nutch 2.3.1 to work with 
> Hadoop 2.4.0. This is the natural move to fit in nicely with Gora 0.6.1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2050) Upgrade HBase and Hadoop versioning on 2.X HBase Docker

2015-09-20 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14899933#comment-14899933
 ] 

Hudson commented on NUTCH-2050:
---

FAILURE: Integrated in Nutch-nutchgora #1537 (See 
[https://builds.apache.org/job/Nutch-nutchgora/1537/])
NUTCH-2050 Upgrade HBase and Hadoop versioning on 2.X HBase Docker (lewismc: 
http://svn.apache.org/viewvc/nutch/branches/2.x/?view=rev=1704129)
* /nutch/branches/2.x/docker/hbase/Dockerfile
* /nutch/branches/2.x/docker/hbase/README.md


> Upgrade HBase and Hadoop versioning on 2.X HBase Docker 
> 
>
> Key: NUTCH-2050
> URL: https://issues.apache.org/jira/browse/NUTCH-2050
> Project: Nutch
>  Issue Type: Improvement
>  Components: docker
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 2.3.1
>
> Attachments: NUTCH-2050.patch
>
>
> We are working on old versioning.
> Lets sort this out.
> 2.X works perfectly with Hadoop 2.X.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1946) Upgrade to Gora 0.6.1

2015-09-20 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14899932#comment-14899932
 ] 

Hudson commented on NUTCH-1946:
---

FAILURE: Integrated in Nutch-nutchgora #1537 (See 
[https://builds.apache.org/job/Nutch-nutchgora/1537/])
NUTCH-1946 Upgrade to Gora 0.6.1 (lewismc: 
http://svn.apache.org/viewvc/nutch/branches/2.x/?view=rev=1704128)
* /nutch/branches/2.x/CHANGES.txt
* /nutch/branches/2.x/conf/nutch-default.xml
* /nutch/branches/2.x/ivy/ivy.xml
* /nutch/branches/2.x/src/java/org/apache/nutch/crawl/DbUpdateMapper.java
* /nutch/branches/2.x/src/java/org/apache/nutch/crawl/DbUpdaterJob.java
* /nutch/branches/2.x/src/java/org/apache/nutch/crawl/GeneratorJob.java
* /nutch/branches/2.x/src/java/org/apache/nutch/crawl/GeneratorMapper.java
* /nutch/branches/2.x/src/java/org/apache/nutch/crawl/GeneratorReducer.java
* /nutch/branches/2.x/src/java/org/apache/nutch/crawl/InjectorJob.java
* /nutch/branches/2.x/src/java/org/apache/nutch/crawl/WebTableReader.java
* /nutch/branches/2.x/src/java/org/apache/nutch/fetcher/FetcherJob.java
* /nutch/branches/2.x/src/java/org/apache/nutch/host/HostDbUpdateJob.java
* /nutch/branches/2.x/src/java/org/apache/nutch/host/HostInjectorJob.java
* /nutch/branches/2.x/src/java/org/apache/nutch/indexer/CleaningJob.java
* /nutch/branches/2.x/src/java/org/apache/nutch/indexer/IndexingJob.java
* 
/nutch/branches/2.x/src/java/org/apache/nutch/indexer/solr/SolrDeleteDuplicates.java
* /nutch/branches/2.x/src/java/org/apache/nutch/parse/OutlinkExtractor.java
* /nutch/branches/2.x/src/java/org/apache/nutch/parse/ParseStatusCodes.java
* /nutch/branches/2.x/src/java/org/apache/nutch/parse/ParseStatusUtils.java
* /nutch/branches/2.x/src/java/org/apache/nutch/parse/ParserJob.java
* 
/nutch/branches/2.x/src/java/org/apache/nutch/protocol/ProtocolStatusUtils.java
* /nutch/branches/2.x/src/java/org/apache/nutch/protocol/RobotRulesParser.java
* /nutch/branches/2.x/src/java/org/apache/nutch/storage/StorageUtils.java
* /nutch/branches/2.x/src/java/org/apache/nutch/tools/Benchmark.java
* /nutch/branches/2.x/src/java/org/apache/nutch/tools/DmozParser.java
* /nutch/branches/2.x/src/java/org/apache/nutch/tools/proxy/FakeHandler.java
* /nutch/branches/2.x/src/java/org/apache/nutch/util/HadoopFSUtil.java
* /nutch/branches/2.x/src/java/org/apache/nutch/util/LockUtil.java
* /nutch/branches/2.x/src/java/org/apache/nutch/util/NutchJob.java
* /nutch/branches/2.x/src/java/org/apache/nutch/util/TableUtil.java
* 
/nutch/branches/2.x/src/java/org/apache/nutch/util/domain/DomainStatistics.java
* /nutch/branches/2.x/src/test/crawl-tests.xml
* /nutch/branches/2.x/src/test/gora.properties
* /nutch/branches/2.x/src/test/nutch-site.xml
* /nutch/branches/2.x/src/test/org/apache/nutch/crawl/TestGenerator.java
* /nutch/branches/2.x/src/test/org/apache/nutch/crawl/TestInjector.java
* /nutch/branches/2.x/src/test/org/apache/nutch/fetcher/TestFetcher.java
* 
/nutch/branches/2.x/src/test/org/apache/nutch/metadata/TestSpellCheckedMetadata.java
* /nutch/branches/2.x/src/test/org/apache/nutch/storage/TestGoraStorage.java
* /nutch/branches/2.x/src/test/org/apache/nutch/util/CrawlTestUtil.java


> Upgrade to Gora 0.6.1
> -
>
> Key: NUTCH-1946
> URL: https://issues.apache.org/jira/browse/NUTCH-1946
> Project: Nutch
>  Issue Type: Improvement
>  Components: storage
>Affects Versions: 2.3.1
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Critical
> Fix For: 2.3.1
>
> Attachments: NUTCH-1946.patch, NUTCH-1946_Gora_fixes.patch, 
> NUTCH-1946v2.patch, NUTCH-1946v3.patch, NUTCH-1946v4.patch
>
>
> Apache Gora 0.6.1 was released recently.
> We should upgrade before pushing Nutch 2.3.1 as it will come in very handy 
> for the new Docker containers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Build failed in Jenkins: Nutch-nutchgora #1537

2015-09-20 Thread Apache Jenkins Server
See 

Changes:

[lewismc] NUTCH-2050 Upgrade HBase and Hadoop versioning on 2.X HBase Docker

[lewismc] NUTCH-1946 Upgrade to Gora 0.6.1

--
[...truncated 3357 lines...]

init:
[mkdir] Created dir: 

[mkdir] Created dir: 

[mkdir] Created dir: 

[mkdir] Created dir: 


init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 


compile:
 [echo] Compiling plugin: urlfilter-validator
[javac] Compiling 1 source file to 


jar:
  [jar] Building jar: 


deps-test:

deploy:
 [copy] Copying 1 file to 


copy-generated-lib:
 [copy] Copying 1 file to 


init:
[mkdir] Created dir: 

[mkdir] Created dir: 

[mkdir] Created dir: 

[mkdir] Created dir: 


init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 


compile:
 [echo] Compiling plugin: urlnormalizer-basic
[javac] Compiling 2 source files to 

[javac] Creating empty 


jar:
  [jar] Building jar: 


deps-test:

deploy:
 [copy] Copying 1 file to 


copy-generated-lib:
 [copy] Copying 1 file to 


init:
[mkdir] Created dir: 

[mkdir] Created dir: 

[mkdir] Created dir: 

[mkdir] Created dir: 


init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 


compile:
 [echo] Compiling plugin: urlnormalizer-pass
[javac] Compiling 2 source files to 

[javac] Creating empty 


jar:
  [jar] Building jar: 


deps-test:

deploy:
 [copy] Copying 1 file to 


copy-generated-lib:
 [copy] Copying 1 file to 

[mkdir] Created dir: 

 [copy] Copying 4 files to 


init:
[mkdir] Created dir: 

[mkdir] Created dir: 


init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 


compile:
   

[jira] [Commented] (NUTCH-2018) Ensure that the Docker containers for Nutch 2.X are part of the Release Management Documentation

2015-09-20 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14899916#comment-14899916
 ] 

Lewis John McGibbney commented on NUTCH-2018:
-

I'll deal with this tomorrow folks.

> Ensure that the Docker containers for Nutch 2.X are part of the Release 
> Management Documentation
> 
>
> Key: NUTCH-2018
> URL: https://issues.apache.org/jira/browse/NUTCH-2018
> Project: Nutch
>  Issue Type: Bug
>  Components: docker, documentation
>Affects Versions: 2.3.1
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Critical
> Fix For: 2.3.1
>
>
> We need to ensure that the new docker containers which live within 
> [https://github.com/apache/nutch/tree/2.x/docker|the docker package] are 
> functional and working when making releases. This means documenting how the 
> code should be updated prior to a release. This work is essential to keep 
> them working. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (NUTCH-2028) java.lang.IllegalArgumentException: can't serialize class org.apache.avro.util.Utf8

2015-09-20 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-2028.
-
Resolution: Fixed

This issue has been resolved in upgrade to Gora 0.6.1.

> java.lang.IllegalArgumentException: can't serialize class 
> org.apache.avro.util.Utf8
> ---
>
> Key: NUTCH-2028
> URL: https://issues.apache.org/jira/browse/NUTCH-2028
> Project: Nutch
>  Issue Type: Bug
>  Components: fetcher
>Affects Versions: 2.3
> Environment: Mac 10.10.3 Nutch 2.3
>Reporter: Roman P
> Fix For: 2.3.1
>
>
> Compiled Nutch 2.3 with MongoDB as a persistence. Getting exception when 
> fetching. Searched for similar errors online, noticed that this issue was 
> addressed in gora 0.6. Tried recompiling with 0.6 but then getting different 
> exception, seems that it's incompatible with hadoop 1.2.0. Tried different 
> versions of hadoop with no luck.
> FetcherJob: starting at 2015-05-31 09:29:04
> FetcherJob: batchId: all
> FetcherJob: threads: 10
> FetcherJob: parsing: false
> FetcherJob: resuming: false
> FetcherJob : timelimit set for : -1
> java.lang.IllegalArgumentException: can't serialize class 
> org.apache.avro.util.Utf8
>   at org.bson.BasicBSONEncoder._putObjectField(BasicBSONEncoder.java:284)
>   at org.bson.BasicBSONEncoder.putObject(BasicBSONEncoder.java:185)
>   at org.bson.BasicBSONEncoder.putObject(BasicBSONEncoder.java:131)
>   at com.mongodb.DefaultDBEncoder.writeObject(DefaultDBEncoder.java:33)
>   at com.mongodb.OutMessage.putObject(OutMessage.java:289)
>   at com.mongodb.OutMessage.writeQuery(OutMessage.java:211)
>   at com.mongodb.OutMessage.query(OutMessage.java:86)
>   at com.mongodb.DBCollectionImpl.find(DBCollectionImpl.java:81)
>   at com.mongodb.DBCollectionImpl.find(DBCollectionImpl.java:66)
>   at com.mongodb.DBCursor._check(DBCursor.java:458)
>   at com.mongodb.DBCursor._hasNext(DBCursor.java:546)
>   at com.mongodb.DBCursor.hasNext(DBCursor.java:571)
>   at 
> org.apache.gora.mongodb.query.MongoDBResult.nextInner(MongoDBResult.java:69)
>   at org.apache.gora.query.impl.ResultBase.next(ResultBase.java:114)
>   at 
> org.apache.gora.mapreduce.GoraRecordReader.nextKeyValue(GoraRecordReader.java:119)
>   at 
> org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:531)
>   at 
> org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
>   at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>   at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
>   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:364)
>   at 
> org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:223)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2105) Update Nutch Cassandra Dockerfile to work with Gora Nutch 2.3.1

2015-09-20 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14899915#comment-14899915
 ] 

Lewis John McGibbney commented on NUTCH-2105:
-

I will work on this tomorrow then push an RC for 2.3.1 folks.

> Update Nutch Cassandra Dockerfile to work with Gora Nutch 2.3.1
> ---
>
> Key: NUTCH-2105
> URL: https://issues.apache.org/jira/browse/NUTCH-2105
> Project: Nutch
>  Issue Type: New Feature
>  Components: docker
>Affects Versions: 2.3.1
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 2.3.1
>
>
> Since we are updating NUTCH-2050 it would be excellent to have the Nutch + 
> Hadoop + Gora + Cassandra stack up-to-date and ready to use as part of the 
> 2.3.1 release. This issue should review the Dockerfile and update it where 
> necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (NUTCH-2050) Upgrade HBase and Hadoop versioning on 2.X HBase Docker

2015-09-20 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-2050.
-
Resolution: Fixed

Committed @revision 1704129 in 2.X HEAD

> Upgrade HBase and Hadoop versioning on 2.X HBase Docker 
> 
>
> Key: NUTCH-2050
> URL: https://issues.apache.org/jira/browse/NUTCH-2050
> Project: Nutch
>  Issue Type: Improvement
>  Components: docker
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 2.3.1
>
> Attachments: NUTCH-2050.patch
>
>
> We are working on old versioning.
> Lets sort this out.
> 2.X works perfectly with Hadoop 2.X.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (NUTCH-1572) Nutch 2.x should use o.a.g.mem.store.MemStore for testing

2015-09-20 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-1572.
-
Resolution: Fixed

Resolved with NUTCH-1946

> Nutch 2.x should use o.a.g.mem.store.MemStore for testing
> -
>
> Key: NUTCH-1572
> URL: https://issues.apache.org/jira/browse/NUTCH-1572
> Project: Nutch
>  Issue Type: Bug
>  Components: test
>Affects Versions: 2.2
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 2.3.1
>
>
> As far as I am aware, there is really no need to be setting up and using 
> hsqldb resources for testing Nutch + Gora functionality fro our tests when 
> there is a MemStore available in Gora.
> In particular TestInjector, TestGenerator, TestFetcher and TestGoraStorage 
> all use gora-sql-incubating-0.1.1 and subsequently a HSQLDB server for 
> tests... this is pretty unnecessary.
> It also happens to be the fact that as of Gora 0.3, the above gora-sql 
> artifact is now deprecated indefinitely.   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2050) Upgrade HBase and Hadoop versioning on 2.X HBase Docker

2015-09-20 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-2050:

Summary: Upgrade HBase and Hadoop versioning on 2.X HBase Docker   (was: 
Upgrade HBase and Hadoop versioning on 2.X Docker )

> Upgrade HBase and Hadoop versioning on 2.X HBase Docker 
> 
>
> Key: NUTCH-2050
> URL: https://issues.apache.org/jira/browse/NUTCH-2050
> Project: Nutch
>  Issue Type: Improvement
>  Components: docker
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 2.3.1
>
> Attachments: NUTCH-2050.patch
>
>
> We are working on old versioning.
> Lets sort this out.
> 2.X works perfectly with Hadoop 2.X.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2110) Create the capability to provide seeds in the form of "url+xpath(including option to enter seach terms).selenium"

2015-09-20 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14899934#comment-14899934
 ] 

Sebastian Nagel commented on NUTCH-2110:


Hi Asitang, the Injector is already able to store key-value pairs from the seed 
list in CrawlDb withing CrawlDatum's meta data, see 
[[1|http://nutch.apache.org/apidocs/apidocs-1.10/org/apache/nutch/crawl/Injector.html]].
 If the XPath statements are not too complex, this would be the easiest way: 
the protocol plugin could then read the XPath from the CrawlDatum.
Regarding the "state of a selenium operation": should the a state be passed to 
the outlinks of a page or is the same page fetched multiple times with varying 
Ajax/JavaScript actions to be performed?

> Create the capability to provide seeds in the form of "url+xpath(including 
> option to enter seach terms).selenium" 
> --
>
> Key: NUTCH-2110
> URL: https://issues.apache.org/jira/browse/NUTCH-2110
> Project: Nutch
>  Issue Type: Sub-task
>  Components: fetcher
>Affects Versions: 1.10
>Reporter: Asitang Mishra
>  Labels: memex
>
> Create the capability to provide seeds in the form of "url+xpath(including 
> option to enter seach terms).selenium" to be used by selenium 
> protocols/plugins as urls/flow to reach to a specific ajax based page or save 
> the state of a selenium operation for the next fetching round.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Questions regarding CS-572 assignment 1

2015-09-20 Thread Mattmann, Chris A (3980)
Hi Charan,

Thanks for your questions. Please copy your emails to
dev@nutch.apache.org and subscribe there, as you will
find more help I believe.

Here are the answers:

-Original Message-

From: Charan Shampur 
Date: Sunday, September 20, 2015 at 3:55 PM
To: jpluser 
Subject: Questions regarding CS-572 assignment 1

>Hello professor,
>
>
>Sorry to interrupt you, I have few questions wandering in my mind from
>last 2 days.
>Here are those:
>
>
>1) I was unable to find any guidelines for using nutchpy to extract data
>from the crawldb. Can you provide me with Some pointers to resources that
>will help.
>

The README.md on nutchpy explains how to use it to read Sequence Files:

https://github.com/ContinuumIO/nutchpy/#running


Then, if you look up the Nutch Sequence File format:

http://wiki.apache.org/nutch/NutchFileFormats


You should be good.

>
>2) How do we read or understand the data extracted by nutch?.I was able
>to collect the list of urls that are crawled by running the readdb
>command.
>For others, how do we do it?

You read the data out of the Nutch DB using NutchPy. So, in fact, readDB is
a great tool (there are also tools to read the LinkDB), but you need to
write a program using NutchPy.

>
>
>3) Is there any API or command that  interacts with nutch crawldb to get
>the Statistical data(Mime type, Http response, Un-fetched urls, etc) ?

Yep the data is stored in the Nutch Data file formats specified
and linked above.

>
>
>I have been reading through the nutch/wiki and was unable to figure it
>out.
>
>
>
>professor, Kindly help me in resolving these...
>
>
>Thanks,
>Charan

HTH.

Cheers,
Chris

+
Chris Mattmann, Ph.D.
Adjunct Associate Professor, Computer Science Department
University of Southern California
Los Angeles, CA 90089 USA
Email: mattm...@usc.edu
WWW: http://sunset.usc.edu/
+