from:"Andrzej Bialecki"


 [ 
https://issues.apache.org/jira/browse/NUTCH-951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  resolved NUTCH-951.
-

Resolution: Fixed

 Backport changes from 2.0 into 1.3
 --

 Key: NUTCH-951
 URL: https://issues.apache.org/jira/browse/NUTCH-951
 Project: Nutch
  Issue Type: Task
Affects Versions: 1.3
Reporter: Julien Nioche
Assignee: Andrzej Bialecki 
Priority: Blocker
 Fix For: 1.3


 I've compared the changes from 2.0 with 1.3 and found the following 
 differences (excluding anything specific to 2.0/GORA)
 *  NUTCH-564 External parser supports encoding attribute (Antony 
 Bowesman, mattmann)
 *  NUTCH-714 Need a SFTP and SCP Protocol Handler (Sanjoy Ghosh, mattmann)
 *  NUTCH-825 Publish nutch artifacts to central maven repository 
 (mattmann)
 *  NUTCH-851 Port logging to slf4j (jnioche)
 *  NUTCH-861 Renamed HTMLParseFilter into ParseFilter
 *  NUTCH-872 Change the default fetcher.parse to FALSE (ab).
 *  NUTCH-876 Remove remaining robots/IP blocking code in lib-http (ab)
 *  NUTCH-880 REST API for Nutch (ab)
 *  NUTCH-883 Remove unused parameters from nutch-default.xml (jnioche)
 *  NUTCH-884 FetcherJob should run more reduce tasks than default (ab)
 *  NUTCH-886 A .gitignore file for Nutch (dogacan)
 *  NUTCH-894 Move statistical language identification from indexing to 
 parsing step
 *  NUTCH-921 Reduce dependency of Nutch on config files (ab)
 *  NUTCH-930 Remove remaining dependencies on Lucene API (ab)
 *  NUTCH-931 Simple admin API to fetch status and stop the service (ab)
 *  NUTCH-932 Bulk REST API to retrieve crawl results as JSON (ab)
 Let's go through this and decide what to port to 1.3

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-951) Backport changes from 2.0 into 1.3


[ 
https://issues.apache.org/jira/browse/NUTCH-951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13004488#comment-13004488
 ] 

Andrzej Bialecki  commented on NUTCH-951:
-

* Ported NUTCH-872 in rev. 1079746.
* Ported NUTCH-876 in rev. 1079753.
* Ported NUTCH-921 in rev. 1079760.
* NUTCH-884 is not applicable to 1.3 because here fetching executes in map 
tasks, so there's a correct number of them already.

 Backport changes from 2.0 into 1.3
 --

 Key: NUTCH-951
 URL: https://issues.apache.org/jira/browse/NUTCH-951
 Project: Nutch
  Issue Type: Task
Affects Versions: 1.3
Reporter: Julien Nioche
Assignee: Andrzej Bialecki 
Priority: Blocker
 Fix For: 1.3


 I've compared the changes from 2.0 with 1.3 and found the following 
 differences (excluding anything specific to 2.0/GORA)
 *  NUTCH-564 External parser supports encoding attribute (Antony 
 Bowesman, mattmann)
 *  NUTCH-714 Need a SFTP and SCP Protocol Handler (Sanjoy Ghosh, mattmann)
 *  NUTCH-825 Publish nutch artifacts to central maven repository 
 (mattmann)
 *  NUTCH-851 Port logging to slf4j (jnioche)
 *  NUTCH-861 Renamed HTMLParseFilter into ParseFilter
 *  NUTCH-872 Change the default fetcher.parse to FALSE (ab).
 *  NUTCH-876 Remove remaining robots/IP blocking code in lib-http (ab)
 *  NUTCH-880 REST API for Nutch (ab)
 *  NUTCH-883 Remove unused parameters from nutch-default.xml (jnioche)
 *  NUTCH-884 FetcherJob should run more reduce tasks than default (ab)
 *  NUTCH-886 A .gitignore file for Nutch (dogacan)
 *  NUTCH-894 Move statistical language identification from indexing to 
 parsing step
 *  NUTCH-921 Reduce dependency of Nutch on config files (ab)
 *  NUTCH-930 Remove remaining dependencies on Lucene API (ab)
 *  NUTCH-931 Simple admin API to fetch status and stop the service (ab)
 *  NUTCH-932 Bulk REST API to retrieve crawl results as JSON (ab)
 Let's go through this and decide what to port to 1.3

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Resolved: (NUTCH-962) max. redirects not handled correctly: fetcher stops at max-1 redirects

[
https://issues.apache.org/jira/browse/NUTCH-962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Andrzej Bialecki resolved NUTCH-962.
-

Resolution: Fixed
Fix Version/s: 2.0
1.3
Assignee: Andrzej Bialecki

max. redirects not handled correctly: fetcher stops at max-1 redirects
--

Key: NUTCH-962
URL: https://issues.apache.org/jira/browse/NUTCH-962
Project: Nutch
Issue Type: Bug
Components: fetcher
Affects Versions: 1.2, 1.3, 2.0
Reporter: Sebastian Nagel
Assignee: Andrzej Bialecki
Fix For: 1.3, 2.0

Attachments: Fetcher_redir.patch

The fetcher stops following redirects one redirect before the max. redirects
is reached.
The description of http.redirect.max
The maximum number of redirects the fetcher will follow when
trying to fetch a page. If set to negative or 0, fetcher won't immediately
follow redirected URLs, instead it will record them for later fetching.
suggests that if set to 1 that one redirect will be followed.
I tried to crawl two documents the first redirecting by
meta http-equiv=refresh content=0; URL=./to/meta_refresh_target.html
to the second with http.redirect.max = 1
The second document is not fetched and the URL has state GONE in CrawlDb.
fetching file:/test/redirects/meta_refresh.html
redirectCount=0
-finishing thread FetcherThread, activeThreads=1
- content redirect to file:/test/redirects/to/meta_refresh_target.html
(fetching now)
- redirect count exceeded file:/test/redirects/to/meta_refresh_target.html
The attached patch would fix this: if http.redirect.max is 1 : one redirect
is followed.
Of course, this would mean there is no possibility to skip redirects at all
since 0
(as well as negative values) means treat redirects as ordinary links.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Resolved: (NUTCH-955) Ivy configuration


 [ 
https://issues.apache.org/jira/browse/NUTCH-955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  resolved NUTCH-955.
-

   Resolution: Fixed
Fix Version/s: 2.0
 Assignee: Andrzej Bialecki 

 Ivy configuration
 -

 Key: NUTCH-955
 URL: https://issues.apache.org/jira/browse/NUTCH-955
 Project: Nutch
  Issue Type: Improvement
  Components: build
Affects Versions: 2.0
Reporter: Alexis
Assignee: Andrzej Bialecki 
 Fix For: 2.0

 Attachments: ivy.patch


 As mentioned in NUTCH-950, we can slightly improve the Ivy configuration to 
 help setup the Gora backend more easily.
 If the user does not want to stick with default HSQL database, other 
 alternatives exist, such as MySQL and HBase.
 org.restlet and xercesImpl versions should be changed as well.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Gora/HBase dependencies and deploy artifacts

2011-02-24 Thread Andrzej Bialecki


Hi all,

Recently I've been deploying Nutch trunk to an already existing Hadoop 
cluster. And immediately I hit a snag.


Nutch was configured to use gora-hbase. The nutch.job jar doesn't 
include gora-hbase even if it was configured in nutch-site.xml. 
Furthermore, gora-hbase depends on HBase and its dependencies, which 
need to be found on classpath.


Typically for development and testing I solved this issue by deploying 
gora-core and gora-hbase + all hbase libs to hadoop/lib across the 
cluster. This is a bit dirty - Hadoop clusters should be seen as a 
generic computing fabric, so they should be application-agnostic, 
besides this creates maintenance  ops issues.


We could put all these libs in lib/ inside nutch.job, so that they are 
unpacked and put on classpath during task setup. This would work fine 
for Mapper/Reducer. HOWEVER... I saw in some versions of Hadoop that 
InputFormat / OutputFormat classes were initialized prior to this 
unpacking - and in our case these depend on the libs in as-yet-unpacked 
job jar... e.g. GoraInputFormat. (I'm not 100% sure that's the case in 
Hadoop 0.20.2, so his is something that needs to be tested).


Furthermore, even if we packed the jars in lib/ inside nutch.job, still 
many tools wouldn't work, because they depend on classes from those libs 
during the local execution (before the job is sent to task trackers), 
and the URLClassLoader can't load classes from jars within jars... A 
workaround for this would be to take all those jars and re-pack them 
together under / directory in nutch.job. This would satisfy the 
dependencies for local execution, and for Mapper/Reducer execution but 
I'm not sure if it solves the problem of Input/OutputFormat-s that I 
mentioned above.


In short, we need a clear working procedure how to deploy Gora backend 
implementations so that they work with Nutch and with a generic 
unmodified Hadoop cluster.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

[jira] Resolved: (NUTCH-939) Added -dir command line option to Indexer and SolrIndexer, allowing to specify directory containing segments


 [ 
https://issues.apache.org/jira/browse/NUTCH-939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  resolved NUTCH-939.
-

Resolution: Fixed
  Assignee: Andrzej Bialecki 

I modified the patch slightly to allow more flexibility (you can mix individual 
segment names and the -dir options) as well as allowing segments placed on 
different filesystems. Committed in rev. 1051505. Thank you!

 Added -dir command line option to Indexer and SolrIndexer,  allowing to 
 specify directory containing segments
 -

 Key: NUTCH-939
 URL: https://issues.apache.org/jira/browse/NUTCH-939
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: 1.3
Reporter: Claudio Martella
Assignee: Andrzej Bialecki 
Priority: Minor
 Fix For: 1.3

 Attachments: Indexer.patch, SolrIndexer.patch


 The patches add -dir option, so the user can specify the directory in which 
 the segments are to be found. The actual mode is to specify the list of 
 segments, which is not very easy with hdfs. Also, the -dir option is already 
 implemented in LinkDB and SegmentMerger, for example.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (NUTCH-948) Remove Lucene dependencies

Remove Lucene dependencies
--

 Key: NUTCH-948
 URL: https://issues.apache.org/jira/browse/NUTCH-948
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.3
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Fix For: 1.3


Branch-1.3 still has Lucene libs, but uses Lucene only in one place, namely it 
uses DateTools in index-basic. DateTools should be replaced with Solr's 
DateUtil, as we did in trunk, and then we can remove Lucene libs as a 
dependency.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (NUTCH-948) Remove Lucene dependencies


 [ 
https://issues.apache.org/jira/browse/NUTCH-948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  resolved NUTCH-948.
-

Resolution: Fixed

Committed in rev. 1051509.

 Remove Lucene dependencies
 --

 Key: NUTCH-948
 URL: https://issues.apache.org/jira/browse/NUTCH-948
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.3
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Fix For: 1.3


 Branch-1.3 still has Lucene libs, but uses Lucene only in one place, namely 
 it uses DateTools in index-basic. DateTools should be replaced with Solr's 
 DateUtil, as we did in trunk, and then we can remove Lucene libs as a 
 dependency.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-939) Added -dir command line option to Indexer and SolrIndexer, allowing to specify directory containing segments


[ 
https://issues.apache.org/jira/browse/NUTCH-939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12973915#action_12973915
 ] 

Andrzej Bialecki  commented on NUTCH-939:
-

1.2 release is out, and branch-1.2 is unlikely to result in a subsequent 
release - most users seem to be interested either in 1.3 or trunk.

 Added -dir command line option to Indexer and SolrIndexer,  allowing to 
 specify directory containing segments
 -

 Key: NUTCH-939
 URL: https://issues.apache.org/jira/browse/NUTCH-939
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: 1.3
Reporter: Claudio Martella
Assignee: Andrzej Bialecki 
Priority: Minor
 Fix For: 1.3

 Attachments: Indexer.patch, SolrIndexer.patch


 The patches add -dir option, so the user can specify the directory in which 
 the segments are to be found. The actual mode is to specify the list of 
 segments, which is not very easy with hdfs. Also, the -dir option is already 
 implemented in LinkDB and SegmentMerger, for example.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: Does Nutch 2.0 in good enough shape to test?

2010-12-17 Thread Andrzej Bialecki


(switching to devs)

On 12/17/10 10:18 AM, Alexis wrote:

Hi,

I've spent some time working on this as well. I've just put together a
blog entry addressing the issues I ran into. See
http://techvineyard.blogspot.com/2010/12/build-nutch-20.html

In a nutchsell, I changed three pieces in Gora and Nutch code:
- flush the datastore regularly in the Hadoop RecordWriter (in GoraOutputFormat)


Careful here. DataStore flush may be very expensive, so it should be 
done only when we are finished with the output. If you see that data is 
lost without this flush then this should be reported as a Gora bug.



- wait for Hadoop job completion in the Fetcher job


I missed your previous email... I'll fix this shortly - thanks for 
spotting it.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Java.io.IOException with multiple copyField/ directives

2010-12-03 Thread Andrzej Bialecki

On 2010-12-03 09:52, Peter Litsegård wrote:
 Hi!
 
 I've run into a strange behaviour while using Nutch (solrindexer) together 
 with Solr 1.4.1. I'd like to copy the 'title' and 'content' field to another 
 field, say, 'foo'. In my first attempt I added the copyField/ directives in 
 schema.xml and got the java exception so I removed them from schema.xml. In 
 my second attempt I added the copyField/ directives to the 
 'solrindex-mapping.xml' file and ran into the same exception again! Is this a 
 known issue or have I stumbled into unknown territory?
 
 Any workarounds?

I suspect that the field type declared in your schema.xml is not
multiValued. What was the exception?


-- 
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

[jira] Commented: (NUTCH-939) Added -dir command line option to Indexer and SolrIndexer, allowing to specify directory containing segments

2010-11-26 Thread Andrzej Bialecki (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12936047#action_12936047
 ] 

Andrzej Bialecki  commented on NUTCH-939:
-

Please note that trunk uses a very different method of working with segments 
(called batches there), and -dir is not applicable there.

 Added -dir command line option to Indexer and SolrIndexer,  allowing to 
 specify directory containing segments
 -

 Key: NUTCH-939
 URL: https://issues.apache.org/jira/browse/NUTCH-939
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: 1.2
Reporter: Claudio Martella
Priority: Minor
 Fix For: 1.2

 Attachments: Indexer.patch, SolrIndexer.patch


 The patches add -dir option, so the user can specify the directory in which 
 the segments are to be found. The actual mode is to specify the list of 
 segments, which is not very easy with hdfs. Also, the -dir option is already 
 implemented in LinkDB and SegmentMerger, for example.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-932) Bulk REST API to retrieve crawl results as JSON

2010-11-25 Thread Andrzej Bialecki (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated NUTCH-932:


Attachment: NUTCH-932-4.patch

Final version of the patch.

 Bulk REST API to retrieve crawl results as JSON
 ---

 Key: NUTCH-932
 URL: https://issues.apache.org/jira/browse/NUTCH-932
 Project: Nutch
  Issue Type: New Feature
  Components: REST_api
Affects Versions: 2.0
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Attachments: db.formatted.gz, NUTCH-932-2.patch, NUTCH-932-3.patch, 
 NUTCH-932-4.patch, NUTCH-932.patch, NUTCH-932.patch, NUTCH-932.patch


 It would be useful to be able to retrieve results of a crawl as JSON. There 
 are a few things that need to be discussed:
 * how to return bulk results using Restlet (WritableRepresentation subclass?)
 * what should be the format of results?
 I think it would make sense to provide a single record retrieval (by primary 
 key), all records, and records within a range. This incidentally matches well 
 the capabilities of the Gora Query class :)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (NUTCH-932) Bulk REST API to retrieve crawl results as JSON

2010-11-25 Thread Andrzej Bialecki (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  resolved NUTCH-932.
-

   Resolution: Fixed
Fix Version/s: 2.0

Committed in rev. 1039014.

 Bulk REST API to retrieve crawl results as JSON
 ---

 Key: NUTCH-932
 URL: https://issues.apache.org/jira/browse/NUTCH-932
 Project: Nutch
  Issue Type: New Feature
  Components: REST_api
Affects Versions: 2.0
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Fix For: 2.0

 Attachments: db.formatted.gz, NUTCH-932-2.patch, NUTCH-932-3.patch, 
 NUTCH-932-4.patch, NUTCH-932.patch, NUTCH-932.patch, NUTCH-932.patch


 It would be useful to be able to retrieve results of a crawl as JSON. There 
 are a few things that need to be discussed:
 * how to return bulk results using Restlet (WritableRepresentation subclass?)
 * what should be the format of results?
 I think it would make sense to provide a single record retrieval (by primary 
 key), all records, and records within a range. This incidentally matches well 
 the capabilities of the Gora Query class :)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-932) Bulk REST API to retrieve crawl results as JSON

2010-11-12 Thread Andrzej Bialecki (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated NUTCH-932:


Attachment: NUTCH-932-3.patch

NutchTool is an abstract class in this patch. This actually minimizes the 
amount of code throughout, though paradoxically the patch file is larger than 
before...

 Bulk REST API to retrieve crawl results as JSON
 ---

 Key: NUTCH-932
 URL: https://issues.apache.org/jira/browse/NUTCH-932
 Project: Nutch
  Issue Type: New Feature
  Components: REST_api
Affects Versions: 2.0
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Attachments: db.formatted.gz, NUTCH-932-2.patch, NUTCH-932-3.patch, 
 NUTCH-932.patch, NUTCH-932.patch, NUTCH-932.patch


 It would be useful to be able to retrieve results of a crawl as JSON. There 
 are a few things that need to be discussed:
 * how to return bulk results using Restlet (WritableRepresentation subclass?)
 * what should be the format of results?
 I think it would make sense to provide a single record retrieval (by primary 
 key), all records, and records within a range. This incidentally matches well 
 the capabilities of the Gora Query class :)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-880) REST API for Nutch

2010-11-05 Thread Andrzej Bialecki (JIRA)

[
https://issues.apache.org/jira/browse/NUTCH-880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12928909#action_12928909
]

Andrzej Bialecki commented on NUTCH-880:
-

Thanks - this issue is already fixed in NUTCH-932, to be committed soon.

REST API for Nutch
--

Key: NUTCH-880
URL: https://issues.apache.org/jira/browse/NUTCH-880
Project: Nutch
Issue Type: New Feature
Affects Versions: 2.0
Reporter: Andrzej Bialecki
Assignee: Andrzej Bialecki
Fix For: 2.0

Attachments: API-2.patch, API.patch

This issue is for discussing a REST-style API for accessing Nutch.
Here's an initial idea:
* I propose to use org.restlet for handling requests and returning
JSON/XML/whatever responses.
* hook up all regular tools so that they can be driven via this API. This
would have to be an async API, since all Nutch operations take long time to
execute. It follows then that we need to be able also to list running
operations, retrieve their current status, and possibly
abort/cancel/stop/suspend/resume/...? This also means that we would have to
potentially create manage many threads in a servlet - AFAIK this is frowned
upon by J2EE purists...
* package this in a webapp (that includes all deps, essentially nutch.job
content), with the restlet servlet as an entry point.
Open issues:
* how to implement the reading of crawl results via this API
* should we manage only crawls that use a single configuration per webapp, or
should we have a notion of crawl contexts (sets of crawl configs) with CRUD
ops on them? this would be nice, because it would allow managing of several
different crawls, with different configs, in a single webapp - but it
complicates the implementation a lot.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-932) Bulk REST API to retrieve crawl results as JSON


 [ 
https://issues.apache.org/jira/browse/NUTCH-932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated NUTCH-932:


Attachment: NUTCH-932.patch

This patch adds bulk retrieval of crawl results. This is still very rough, e.g. 
there's no way to select crawlId or limit the fields... but it returns proper 
JSON.

This patch also includes other enhancements and bugfixes - with this patch I 
was able to perform a complete crawl cycle via REST.

 Bulk REST API to retrieve crawl results as JSON
 ---

 Key: NUTCH-932
 URL: https://issues.apache.org/jira/browse/NUTCH-932
 Project: Nutch
  Issue Type: New Feature
  Components: REST_api
Affects Versions: 2.0
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Attachments: NUTCH-932.patch


 It would be useful to be able to retrieve results of a crawl as JSON. There 
 are a few things that need to be discussed:
 * how to return bulk results using Restlet (WritableRepresentation subclass?)
 * what should be the format of results?
 I think it would make sense to provide a single record retrieval (by primary 
 key), all records, and records within a range. This incidentally matches well 
 the capabilities of the Gora Query class :)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-932) Bulk REST API to retrieve crawl results as JSON


 [ 
https://issues.apache.org/jira/browse/NUTCH-932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated NUTCH-932:


Attachment: db.formatted.gz

Example DB content (this was passed through a JSON pretty-printer, otherwise 
it's just one giant line...).

 Bulk REST API to retrieve crawl results as JSON
 ---

 Key: NUTCH-932
 URL: https://issues.apache.org/jira/browse/NUTCH-932
 Project: Nutch
  Issue Type: New Feature
  Components: REST_api
Affects Versions: 2.0
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Attachments: db.formatted.gz, NUTCH-932.patch


 It would be useful to be able to retrieve results of a crawl as JSON. There 
 are a few things that need to be discussed:
 * how to return bulk results using Restlet (WritableRepresentation subclass?)
 * what should be the format of results?
 I think it would make sense to provide a single record retrieval (by primary 
 key), all records, and records within a range. This incidentally matches well 
 the capabilities of the Gora Query class :)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-932) Bulk REST API to retrieve crawl results as JSON


[ 
https://issues.apache.org/jira/browse/NUTCH-932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12928355#action_12928355
 ] 

Andrzej Bialecki  commented on NUTCH-932:
-

Examples (with the db equivalent to the one in db.formatted.gz):

{code}
$ curl -s 
'http://localhost:8192/nutch/db?fields=urlend=http://www.freebsd.org/start=http://www.egothor.org/'|
 ./json_pp
[
  {
url: http://www.egothor.org/;
  }, 
  {
url: http://www.freebsd.org/;
  }
]
{code}

{code}
$ curl -s 
'http://localhost:8192/nutch/db?fields=url,outlinks,markers,protocolStatus,parseStatus,contentTypestart=http://www.getopt.org/end=http://www.getopt.org/'|
 ./json_pp
[
  {
contentType: text/html, 
url: http://www.getopt.org/;, 
markers: {
  _updmrk_: 1288890451-1134865895
}, 
parseStatus: success/ok (1/0), args=[], 
protocolStatus: SUCCESS, args=[], 
outlinks: {
  http://www.getopt.org/luke/: Luke, 
  http://www.getopt.org/ecimf/contrib/ONTO/REA: REA Ontology page, 
  http://www.getopt.org/CV.pdf: CV here, 
  http://www.getopt.org/utils/build/api: API, 
  
http://svn.apache.org/viewvc/hadoop/hbase/trunk/src/java/org/apache/hadoop/hbase/util/JenkinsHash.java:
 available here, 
  http://www.getopt.org/murmur/MurmurHash.java: MurmurHash.java, 
  http://www.ebxml.org/: ebXML / ebTWG, 
  http://www.freebsd.org/: FreeBSD, 
  http://www.getopt.org/luke/webstart.html: Launch with Java WebStart, 
  http://www.freebsd.org/%7Epicobsd: PicoBSD, 
  http://home.comcast.net/~bretm/hash/6.html: this discussion, 
  http://protege.stanford.edu/: Protege, 
  http://jakarta.apache.org/lucene: Lucene, 
  http://www.getopt.org/ecimf/contrib/ONTO/ebxml: ebXML Ontology, 
  http://www.getopt.org/ecimf/: here, 
  http://www.isthe.com/chongo/tech/comp/fnv/: his website, 
  http://www.getopt.org/stempel/index.html: Stempel, 
  http://www.sigram.com/: SIGRAM, 
  http://www.egothor.org/: Egothor, 
  http://thinlet.sourceforge.net/: Thinlet, 
  http://www.getopt.org/utils/dist/utils-1.0.jar: binary, 
  http://www.ecimf.org/: ECIMF
}
  }
]
{code}


 Bulk REST API to retrieve crawl results as JSON
 ---

 Key: NUTCH-932
 URL: https://issues.apache.org/jira/browse/NUTCH-932
 Project: Nutch
  Issue Type: New Feature
  Components: REST_api
Affects Versions: 2.0
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Attachments: db.formatted.gz, NUTCH-932.patch, NUTCH-932.patch


 It would be useful to be able to retrieve results of a crawl as JSON. There 
 are a few things that need to be discussed:
 * how to return bulk results using Restlet (WritableRepresentation subclass?)
 * what should be the format of results?
 I think it would make sense to provide a single record retrieval (by primary 
 key), all records, and records within a range. This incidentally matches well 
 the capabilities of the Gora Query class :)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-932) Bulk REST API to retrieve crawl results as JSON


 [ 
https://issues.apache.org/jira/browse/NUTCH-932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated NUTCH-932:


Attachment: NUTCH-932.patch

Updated patch - this recognizes now URL parameters such as fields, start/end 
keys, batch and crawl id.

 Bulk REST API to retrieve crawl results as JSON
 ---

 Key: NUTCH-932
 URL: https://issues.apache.org/jira/browse/NUTCH-932
 Project: Nutch
  Issue Type: New Feature
  Components: REST_api
Affects Versions: 2.0
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Attachments: db.formatted.gz, NUTCH-932.patch, NUTCH-932.patch


 It would be useful to be able to retrieve results of a crawl as JSON. There 
 are a few things that need to be discussed:
 * how to return bulk results using Restlet (WritableRepresentation subclass?)
 * what should be the format of results?
 I think it would make sense to provide a single record retrieval (by primary 
 key), all records, and records within a range. This incidentally matches well 
 the capabilities of the Gora Query class :)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (NUTCH-931) Simple admin API to fetch status and stop the service

2010-10-29 Thread Andrzej Bialecki (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  resolved NUTCH-931.
-

Resolution: Fixed

Committed in rev. 1028736 with some changes.

 Simple admin API to fetch status and stop the service
 -

 Key: NUTCH-931
 URL: https://issues.apache.org/jira/browse/NUTCH-931
 Project: Nutch
  Issue Type: Improvement
  Components: REST_api
Affects Versions: 2.0
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Fix For: 2.0

 Attachments: NUTCH-931.patch


 REST API needs a simple info / stats service and the ability to shutdown the 
 server.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-880) REST API for Nutch

[
https://issues.apache.org/jira/browse/NUTCH-880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Andrzej Bialecki updated NUTCH-880:

Summary: REST API for Nutch (was: REST API (and webapp) for Nutch)

The webapp part is tracked now in NUTCH-929.

REST API for Nutch
--

Key: NUTCH-880
URL: https://issues.apache.org/jira/browse/NUTCH-880
Project: Nutch
Issue Type: New Feature
Affects Versions: 2.0
Reporter: Andrzej Bialecki
Assignee: Andrzej Bialecki
Attachments: API-2.patch, API.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (NUTCH-880) REST API for Nutch

[
https://issues.apache.org/jira/browse/NUTCH-880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Andrzej Bialecki resolved NUTCH-880.
-

Resolution: Fixed
Fix Version/s: 2.0

Committed in rev. 1028235. The webapp part of this issue is tracked now in
NUTCH-929.

REST API for Nutch
--

Key: NUTCH-880
URL: https://issues.apache.org/jira/browse/NUTCH-880
Project: Nutch
Issue Type: New Feature
Affects Versions: 2.0
Reporter: Andrzej Bialecki
Assignee: Andrzej Bialecki
Fix For: 2.0

Attachments: API-2.patch, API.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (NUTCH-930) Remove remaining dependencies on Lucene API

Remove remaining dependencies on Lucene API
---

 Key: NUTCH-930
 URL: https://issues.apache.org/jira/browse/NUTCH-930
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 2.0
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 


Nutch doesn't use Lucene API anymore, all indexing happens via Lucene-agnostic 
SolrJ API. The only place where we still use a minor part of Lucene is in 
index-basic, and that use (DateTools) can be easily replaced.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-930) Remove remaining dependencies on Lucene API


 [ 
https://issues.apache.org/jira/browse/NUTCH-930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated NUTCH-930:


Attachment: NUTCH-930.patch

Patch to fix the issue. I'll commit this shortly.

 Remove remaining dependencies on Lucene API
 ---

 Key: NUTCH-930
 URL: https://issues.apache.org/jira/browse/NUTCH-930
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 2.0
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Attachments: NUTCH-930.patch


 Nutch doesn't use Lucene API anymore, all indexing happens via 
 Lucene-agnostic SolrJ API. The only place where we still use a minor part of 
 Lucene is in index-basic, and that use (DateTools) can be easily replaced.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (NUTCH-930) Remove remaining dependencies on Lucene API


 [ 
https://issues.apache.org/jira/browse/NUTCH-930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  resolved NUTCH-930.
-

   Resolution: Fixed
Fix Version/s: 2.0

Committed in rev. 1028474.

 Remove remaining dependencies on Lucene API
 ---

 Key: NUTCH-930
 URL: https://issues.apache.org/jira/browse/NUTCH-930
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 2.0
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Fix For: 2.0

 Attachments: NUTCH-930.patch


 Nutch doesn't use Lucene API anymore, all indexing happens via 
 Lucene-agnostic SolrJ API. The only place where we still use a minor part of 
 Lucene is in index-basic, and that use (DateTools) can be easily replaced.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (NUTCH-931) Simple admin API to fetch status and stop the service