[jira] [Commented] (NUTCH-1708) use same id when indexing and deleting redirects

2014-07-29 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14078012#comment-14078012
 ] 

Hudson commented on NUTCH-1708:
---

SUCCESS: Integrated in Nutch-nutchgora #1102 (See 
[https://builds.apache.org/job/Nutch-nutchgora/1102/])
NUTCH-1708 use same id when indexing and deleting redirects (snagel: 
http://svn.apache.org/viewvc/nutch/branches/2.x/?view=revrev=1614375)
* /nutch/branches/2.x/CHANGES.txt
* /nutch/branches/2.x/conf/schema.xml
* 
/nutch/branches/2.x/src/java/org/apache/nutch/indexer/IndexingFiltersChecker.java
* 
/nutch/branches/2.x/src/plugin/indexer-elastic/src/java/org/apache/nutch/indexwriter/elastic/ElasticIndexWriter.java
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/conf/schema-solr4.xml
* /nutch/trunk/conf/schema.xml
* /nutch/trunk/conf/solrindex-mapping.xml
* /nutch/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java
* /nutch/trunk/src/java/org/apache/nutch/indexer/IndexingFiltersChecker.java
* 
/nutch/trunk/src/plugin/indexer-dummy/src/java/org/apache/nutch/indexwriter/dummy/DummyIndexWriter.java
* 
/nutch/trunk/src/plugin/indexer-elastic/src/java/org/apache/nutch/indexwriter/elastic/ElasticIndexWriter.java


 use same id when indexing and deleting redirects
 

 Key: NUTCH-1708
 URL: https://issues.apache.org/jira/browse/NUTCH-1708
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.7
Reporter: Sebastian Nagel
 Fix For: 2.3, 1.9

 Attachments: NUTCH-1708-2x-v1.patch, NUTCH-1708-trunk-v1.patch, 
 NUTCH-1708-trunk-v2.patch


 Redirect targets are indexed using representative URL
 * in Fetcher repr URL is determined by URLUtil.chooseRepr() and stored in 
 CrawlDatum (CrawlDb). Repr URL is either source or target URL of the redirect 
 pair.
 * NutchField url is filled by basic indexing filter with repr URL
 * id field used as unique key is filled from url per solrindex-mapping.xml
 Deletion of redirects is done in IndexerMapReduce.reduce() by key which is 
 the URL of the redirect source. If the source URL is chosen as repr URL a 
 redirect target may get erroneously deleted.
 Test crawl with seed {{http://wiki.apache.org/nutch}} which redirects to 
 {{http://wiki.apache.org/nutch/}}. DummyIndexWriter (NUTCH-1707) indicates 
 that same URL is deleted and added:
 {code}
 delete  http://wiki.apache.org/nutch
 add http://wiki.apache.org/nutch
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (NUTCH-1708) use same id when indexing and deleting redirects

2014-07-29 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14078042#comment-14078042
 ] 

Hudson commented on NUTCH-1708:
---

SUCCESS: Integrated in Nutch-trunk #2724 (See 
[https://builds.apache.org/job/Nutch-trunk/2724/])
NUTCH-1708 use same id when indexing and deleting redirects (snagel: 
http://svn.apache.org/viewvc/nutch/trunk/?view=revrev=1614375)
* /nutch/branches/2.x/CHANGES.txt
* /nutch/branches/2.x/conf/schema.xml
* 
/nutch/branches/2.x/src/java/org/apache/nutch/indexer/IndexingFiltersChecker.java
* 
/nutch/branches/2.x/src/plugin/indexer-elastic/src/java/org/apache/nutch/indexwriter/elastic/ElasticIndexWriter.java
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/conf/schema-solr4.xml
* /nutch/trunk/conf/schema.xml
* /nutch/trunk/conf/solrindex-mapping.xml
* /nutch/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java
* /nutch/trunk/src/java/org/apache/nutch/indexer/IndexingFiltersChecker.java
* 
/nutch/trunk/src/plugin/indexer-dummy/src/java/org/apache/nutch/indexwriter/dummy/DummyIndexWriter.java
* 
/nutch/trunk/src/plugin/indexer-elastic/src/java/org/apache/nutch/indexwriter/elastic/ElasticIndexWriter.java


 use same id when indexing and deleting redirects
 

 Key: NUTCH-1708
 URL: https://issues.apache.org/jira/browse/NUTCH-1708
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.7
Reporter: Sebastian Nagel
 Fix For: 2.3, 1.9

 Attachments: NUTCH-1708-2x-v1.patch, NUTCH-1708-trunk-v1.patch, 
 NUTCH-1708-trunk-v2.patch


 Redirect targets are indexed using representative URL
 * in Fetcher repr URL is determined by URLUtil.chooseRepr() and stored in 
 CrawlDatum (CrawlDb). Repr URL is either source or target URL of the redirect 
 pair.
 * NutchField url is filled by basic indexing filter with repr URL
 * id field used as unique key is filled from url per solrindex-mapping.xml
 Deletion of redirects is done in IndexerMapReduce.reduce() by key which is 
 the URL of the redirect source. If the source URL is chosen as repr URL a 
 redirect target may get erroneously deleted.
 Test crawl with seed {{http://wiki.apache.org/nutch}} which redirects to 
 {{http://wiki.apache.org/nutch/}}. DummyIndexWriter (NUTCH-1707) indicates 
 that same URL is deleted and added:
 {code}
 delete  http://wiki.apache.org/nutch
 add http://wiki.apache.org/nutch
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (NUTCH-1708) use same id when indexing and deleting redirects

2014-07-21 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14068374#comment-14068374
 ] 

Julien Nioche commented on NUTCH-1708:
--

I like the approach and this would be the best way of solving the issue. +1 to 
commit

Re-field  orig in 2.x : sounds like a duplicate of 'id' indeed. Let's do its 
removal separately

Thanks

 use same id when indexing and deleting redirects
 

 Key: NUTCH-1708
 URL: https://issues.apache.org/jira/browse/NUTCH-1708
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.7
Reporter: Sebastian Nagel
 Fix For: 1.9

 Attachments: NUTCH-1708-2x-v1.patch, NUTCH-1708-trunk-v1.patch


 Redirect targets are indexed using representative URL
 * in Fetcher repr URL is determined by URLUtil.chooseRepr() and stored in 
 CrawlDatum (CrawlDb). Repr URL is either source or target URL of the redirect 
 pair.
 * NutchField url is filled by basic indexing filter with repr URL
 * id field used as unique key is filled from url per solrindex-mapping.xml
 Deletion of redirects is done in IndexerMapReduce.reduce() by key which is 
 the URL of the redirect source. If the source URL is chosen as repr URL a 
 redirect target may get erroneously deleted.
 Test crawl with seed {{http://wiki.apache.org/nutch}} which redirects to 
 {{http://wiki.apache.org/nutch/}}. DummyIndexWriter (NUTCH-1707) indicates 
 that same URL is deleted and added:
 {code}
 delete  http://wiki.apache.org/nutch
 add http://wiki.apache.org/nutch
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (NUTCH-1708) use same id when indexing and deleting redirects

2014-04-14 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13968270#comment-13968270
 ] 

Markus Jelsma commented on NUTCH-1708:
--

Yes, that seems reasonable, but we still need to get rid of the repr_url. To me 
it makes little sense to have  such strange behaviour in index-basic.

 use same id when indexing and deleting redirects
 

 Key: NUTCH-1708
 URL: https://issues.apache.org/jira/browse/NUTCH-1708
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.7
Reporter: Sebastian Nagel

 Redirect targets are indexed using representative URL
 * in Fetcher repr URL is determined by URLUtil.chooseRepr() and stored in 
 CrawlDatum (CrawlDb). Repr URL is either source or target URL of the redirect 
 pair.
 * NutchField url is filled by basic indexing filter with repr URL
 * id field used as unique key is filled from url per solrindex-mapping.xml
 Deletion of redirects is done in IndexerMapReduce.reduce() by key which is 
 the URL of the redirect source. If the source URL is chosen as repr URL a 
 redirect target may get erroneously deleted.
 Test crawl with seed {{http://wiki.apache.org/nutch}} which redirects to 
 {{http://wiki.apache.org/nutch/}}. DummyIndexWriter (NUTCH-1707) indicates 
 that same URL is deleted and added:
 {code}
 delete  http://wiki.apache.org/nutch
 add http://wiki.apache.org/nutch
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (NUTCH-1708) use same id when indexing and deleting redirects

2014-04-14 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13968762#comment-13968762
 ] 

Sebastian Nagel commented on NUTCH-1708:


??need to get rid of the repr_url??
Not necessarily:
# if we use for field 'id' the URL a document has been accessed (with any 
possible status), everything (indexing, updating, deletion) should work -- 
those IDs are in sync with CrawlDb and may never appear twice.
# then we are free to fill the field 'url' with a more pretty thing: repr URL 
(usually shorter), punycoded (without ugly {{xn--}}), showing letters instead 
of percent-encoded sequences, etc. Since field 'url' is tokenized, decoding the 
content makes more sense. In doubt, we could make it configurable which of 
these denormalization steps are applied.
# finally, we achieve the same behaviour in 1.x and 2.x

 use same id when indexing and deleting redirects
 

 Key: NUTCH-1708
 URL: https://issues.apache.org/jira/browse/NUTCH-1708
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.7
Reporter: Sebastian Nagel

 Redirect targets are indexed using representative URL
 * in Fetcher repr URL is determined by URLUtil.chooseRepr() and stored in 
 CrawlDatum (CrawlDb). Repr URL is either source or target URL of the redirect 
 pair.
 * NutchField url is filled by basic indexing filter with repr URL
 * id field used as unique key is filled from url per solrindex-mapping.xml
 Deletion of redirects is done in IndexerMapReduce.reduce() by key which is 
 the URL of the redirect source. If the source URL is chosen as repr URL a 
 redirect target may get erroneously deleted.
 Test crawl with seed {{http://wiki.apache.org/nutch}} which redirects to 
 {{http://wiki.apache.org/nutch/}}. DummyIndexWriter (NUTCH-1707) indicates 
 that same URL is deleted and added:
 {code}
 delete  http://wiki.apache.org/nutch
 add http://wiki.apache.org/nutch
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (NUTCH-1708) use same id when indexing and deleting redirects

2014-03-28 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13951566#comment-13951566
 ] 

Sebastian Nagel commented on NUTCH-1708:


HI [~markus17], another way would be to fill the fields 'id' and 'url' 
differently, as it's done in 2.x. This approach would also allow to decode 
punycoded IDNs, see NUTCH-1321.

 use same id when indexing and deleting redirects
 

 Key: NUTCH-1708
 URL: https://issues.apache.org/jira/browse/NUTCH-1708
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.7
Reporter: Sebastian Nagel

 Redirect targets are indexed using representative URL
 * in Fetcher repr URL is determined by URLUtil.chooseRepr() and stored in 
 CrawlDatum (CrawlDb). Repr URL is either source or target URL of the redirect 
 pair.
 * NutchField url is filled by basic indexing filter with repr URL
 * id field used as unique key is filled from url per solrindex-mapping.xml
 Deletion of redirects is done in IndexerMapReduce.reduce() by key which is 
 the URL of the redirect source. If the source URL is chosen as repr URL a 
 redirect target may get erroneously deleted.
 Test crawl with seed {{http://wiki.apache.org/nutch}} which redirects to 
 {{http://wiki.apache.org/nutch/}}. DummyIndexWriter (NUTCH-1707) indicates 
 that same URL is deleted and added:
 {code}
 delete  http://wiki.apache.org/nutch
 add http://wiki.apache.org/nutch
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (NUTCH-1708) use same id when indexing and deleting redirects

2014-01-20 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13876288#comment-13876288
 ] 

Markus Jelsma commented on NUTCH-1708:
--

Hi Sebastian - we've had issues with that before and tracked it down to the 
representative URL being indexing in index-basic as well. We choose to 
completely remove that from our custom indexing filter, in my opinion an URL 
must be indexed by it's real URL, not some representative URL. Indexing 
representative URL's also causes duplicates, which may or may not be removed by 
Nutch' new deduplicating code because the signatures are usually not the same.

 use same id when indexing and deleting redirects
 

 Key: NUTCH-1708
 URL: https://issues.apache.org/jira/browse/NUTCH-1708
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.7
Reporter: Sebastian Nagel

 Redirect targets are indexed using representative URL
 * in Fetcher repr URL is determined by URLUtil.chooseRepr() and stored in 
 CrawlDatum (CrawlDb). Repr URL is either source or target URL of the redirect 
 pair.
 * NutchField url is filled by basic indexing filter with repr URL
 * id field used as unique key is filled from url per solrindex-mapping.xml
 Deletion of redirects is done in IndexerMapReduce.reduce() by key which is 
 the URL of the redirect source. If the source URL is chosen as repr URL a 
 redirect target may get erroneously deleted.
 Test crawl with seed {{http://wiki.apache.org/nutch}} which redirects to 
 {{http://wiki.apache.org/nutch/}}. DummyIndexWriter (NUTCH-1707) indicates 
 that same URL is deleted and added:
 {code}
 delete  http://wiki.apache.org/nutch
 add http://wiki.apache.org/nutch
 {code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)