[jira] [Created] (NUTCH-1013) Migrate RegexURLNormalizer from Apache ORO to java.util.regex

2011-06-24 Thread Markus Jelsma (JIRA)
Migrate RegexURLNormalizer from Apache ORO to java.util.regex
-

 Key: NUTCH-1013
 URL: https://issues.apache.org/jira/browse/NUTCH-1013
 Project: Nutch
  Issue Type: Improvement
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.4, 2.0


Apache ORO uses old Perl 5-style regular expressions. Features such as the 
powerful lookbehind are not available. The project has become retired as well. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1013) Migrate RegexURLNormalizer from Apache ORO to java.util.regex

2011-06-24 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1013:
-

Attachment: NUTCH-1013-1.4.patch

Patch for RegexURLNormalizer for 1.4. Seems to work fine.

 Migrate RegexURLNormalizer from Apache ORO to java.util.regex
 -

 Key: NUTCH-1013
 URL: https://issues.apache.org/jira/browse/NUTCH-1013
 Project: Nutch
  Issue Type: Improvement
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.4, 2.0

 Attachments: NUTCH-1013-1.4.patch


 Apache ORO uses old Perl 5-style regular expressions. Features such as the 
 powerful lookbehind are not available. The project has become retired as 
 well. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1011) Normalize duplicate slashes in URL's

2011-06-24 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13054438#comment-13054438
 ] 

Markus Jelsma commented on NUTCH-1011:
--

This normalizer works with NUTCH-1013.
 
{code}
!-- removes duplicate slashes --
regex
  pattern(?lt;!:)/{2,}/pattern
  substitution//substitution
/regex
{code}

 Normalize duplicate slashes in URL's
 

 Key: NUTCH-1011
 URL: https://issues.apache.org/jira/browse/NUTCH-1011
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.4, 2.0
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Minor
 Attachments: NUTCH-1011-all-3.patch


 Many websites produce faulty URL's with multiple slashes e.g. 
 http://cocoon.apache.org///1.x/dynamic.html
 This can be really nasty if the number of slashes varies, resulting in many 
 URL's actually pointing to the same page and generating new (unique) URL's to 
 the same or other duplicate pages.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Issue Comment Edited] (NUTCH-1013) Migrate RegexURLNormalizer from Apache ORO to java.util.regex

2011-06-24 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13054437#comment-13054437
 ] 

Markus Jelsma edited comment on NUTCH-1013 at 6/24/11 1:30 PM:
---

Patch for RegexURLNormalizer for 1.4. Seems to work fine. It also compiles 
against trunk. Unit tests pass as well. 

Are there objections? Thinks to take special care off? 

  was (Author: markus17):
Patch for RegexURLNormalizer for 1.4. Seems to work fine. It also compiles 
against trunk.
  
 Migrate RegexURLNormalizer from Apache ORO to java.util.regex
 -

 Key: NUTCH-1013
 URL: https://issues.apache.org/jira/browse/NUTCH-1013
 Project: Nutch
  Issue Type: Improvement
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.4, 2.0

 Attachments: NUTCH-1013-1.4.patch


 Apache ORO uses old Perl 5-style regular expressions. Features such as the 
 powerful lookbehind are not available. The project has become retired as 
 well. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1013) Migrate RegexURLNormalizer from Apache ORO to java.util.regex

2011-06-24 Thread Ken Krugler (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13054471#comment-13054471
 ] 

Ken Krugler commented on NUTCH-1013:


No comment directly related to this patch, but URL normalization seems like a 
great component to move into crawler-commons, since all web crawlers need to do 
the same thing.

 Migrate RegexURLNormalizer from Apache ORO to java.util.regex
 -

 Key: NUTCH-1013
 URL: https://issues.apache.org/jira/browse/NUTCH-1013
 Project: Nutch
  Issue Type: Improvement
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.4, 2.0

 Attachments: NUTCH-1013-1.4.patch


 Apache ORO uses old Perl 5-style regular expressions. Features such as the 
 powerful lookbehind are not available. The project has become retired as 
 well. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1012) Cannot handle illegal charset $charset

2011-06-24 Thread Ken Krugler (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13054475#comment-13054475
 ] 

Ken Krugler commented on NUTCH-1012:


Tika has code to try to resolve charset names (and handle common error cases) 
in a graceful manner. Nutch might want to use this code, or we could add a 
general wrapper to crawler-commons. See CharsetUtils in Tika.

 Cannot handle illegal charset $charset
 --

 Key: NUTCH-1012
 URL: https://issues.apache.org/jira/browse/NUTCH-1012
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.3
Reporter: Markus Jelsma
Priority: Minor
 Fix For: 1.4


 Pages returning:
 {code}
 Content-Type: text/html; charset=$charset
 {code}
 cause:
 {code}
 Error parsing: http://host/: failed(2,200): 
 java.nio.charset.IllegalCharsetNameException: $charset
 Found a TextHeaderAtom not followed by a TextBytesAtom or TextCharsAtom: 
 Followed by 3999
 ParseSegment: finished at 2011-06-24 01:14:54, elapsed: 00:01:12
 {code}
 Stack trace:
 {code}
 2011-06-24 01:14:23,442 WARN  parse.html - 
 java.nio.charset.IllegalCharsetNameException: $charset
 2011-06-24 01:14:23,442 WARN  parse.html - at 
 java.nio.charset.Charset.checkName(Charset.java:284)
 2011-06-24 01:14:23,442 WARN  parse.html - at 
 java.nio.charset.Charset.lookup2(Charset.java:458)
 2011-06-24 01:14:23,442 WARN  parse.html - at 
 java.nio.charset.Charset.lookup(Charset.java:437)
 2011-06-24 01:14:23,442 WARN  parse.html - at 
 java.nio.charset.Charset.isSupported(Charset.java:479)
 2011-06-24 01:14:23,442 WARN  parse.html - at 
 org.apache.nutch.util.EncodingDetector.resolveEncodingAlias(EncodingDetector.java:310)
 2011-06-24 01:14:23,442 WARN  parse.html - at 
 org.apache.nutch.util.EncodingDetector.addClue(EncodingDetector.java:201)
 2011-06-24 01:14:23,442 WARN  parse.html - at 
 org.apache.nutch.util.EncodingDetector.addClue(EncodingDetector.java:208)
 2011-06-24 01:14:23,442 WARN  parse.html - at 
 org.apache.nutch.util.EncodingDetector.autoDetectClues(EncodingDetector.java:193)
 2011-06-24 01:14:23,442 WARN  parse.html - at 
 org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:138)
 2011-06-24 01:14:23,442 WARN  parse.html - at 
 org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35)
 2011-06-24 01:14:23,443 WARN  parse.html - at 
 org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24)
 2011-06-24 01:14:23,443 WARN  parse.html - at 
 java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
 2011-06-24 01:14:23,443 WARN  parse.html - at 
 java.util.concurrent.FutureTask.run(FutureTask.java:138)
 2011-06-24 01:14:23,443 WARN  parse.html - at 
 java.lang.Thread.run(Thread.java:662)
 2011-06-24 01:14:23,443 WARN  parse.ParseSegment - Error parsing: 
 http://host/: failed(2,200): java.nio.charset.Ill
 egalCharsetNameException: $charset
 {code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (NUTCH-1014) Migrate from Apache ORO to java.util.regex

2011-06-24 Thread Markus Jelsma (JIRA)
Migrate from Apache ORO to java.util.regex
--

 Key: NUTCH-1014
 URL: https://issues.apache.org/jira/browse/NUTCH-1014
 Project: Nutch
  Issue Type: Improvement
Reporter: Markus Jelsma
 Fix For: 1.4, 2.0


A separate issue tracking migration of all components from Apache ORO to 
java.util.regex. Components involved are:
- RegexURLNormalzier
- OutlinkExtractor
- JSParseFilter
- MoreIndexingFilter
- BasicURLNormalizer

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[Nutch Wiki] Trivial Update of RunningNutchAndSolr by LewisJohnMcgibbney

2011-06-24 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The RunningNutchAndSolr page has been changed by LewisJohnMcgibbney:
http://wiki.apache.org/nutch/RunningNutchAndSolr?action=diffrev1=45rev2=46

  = New in Nutch 1.3 =
- Please note that in the nightly version of Apache Nutch there is now a Solr 
integration embedded so you can start to use a lot easier. Just download a 
nightly version from http://hudson.zones.apache.org/hudson/job/Nutch-trunk/.
+ Please note that Apache Nutch release 1.3 now has Solr integration embedded, 
this greatly eases Nutch-Solr integration. Just download release 1.3 from 
[[http://www.apache.org/dyn/closer.cgi/nutch/|here]].
  
  = Pre Solr Nutch integration =
  This is just a quick first pass at a guide for getting Nutch running with 
Solr.  I'm sure there are better ways of doing some/all of it, but I'm not 
aware of them.  By all means, please do correct/update this if someone has a 
better idea.  Many thanks to http://variogram.com and http://blog.foofactory.fi 
for all the help!  You guys saved me a lot of time! :)


[Nutch Wiki] Trivial Update of RunningNutchAndSolr by LewisJohnMcgibbney

2011-06-24 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The RunningNutchAndSolr page has been changed by LewisJohnMcgibbney:
http://wiki.apache.org/nutch/RunningNutchAndSolr?action=diffrev1=46rev2=47

+ This tutorial was originally constructed and posted by 'waycool' on the user 
lists. It has been edited slightly for integration into the Apache Nutch 
project.
- = New in Nutch 1.3 =
- Please note that Apache Nutch release 1.3 now has Solr integration embedded, 
this greatly eases Nutch-Solr integration. Just download release 1.3 from 
[[http://www.apache.org/dyn/closer.cgi/nutch/|here]].
  
+ = Notes about Nutch 1.3 =
+ Please note that Apache Nutch release 1.3 has Solr integration embedded, this 
greatly eases Nutch-Solr integration. Just download release 1.3 from 
[[http://www.apache.org/dyn/closer.cgi/nutch/|here]]. This also removes the 
legacy dependence upon both Apache Tomcat for running the old Nutch WebApp and 
upon Lucene for indexing
- = Pre Solr Nutch integration =
- This is just a quick first pass at a guide for getting Nutch running with 
Solr.  I'm sure there are better ways of doing some/all of it, but I'm not 
aware of them.  By all means, please do correct/update this if someone has a 
better idea.  Many thanks to http://variogram.com and http://blog.foofactory.fi 
for all the help!  You guys saved me a lot of time! :)
- 
- I'm posting it under Nutch rather than Solr on the presumption that people 
are more likely to be learning/using Solr first, then come here looking to 
combine it with Nutch.  I'm going to skip over doing command by command for 
right now.  I'm running/building on Ubuntu 7.10 using Java 1.6.0_05.  I'm 
assuming that the Solr trunk code is checked out into solr-trunk and Nutch 
trunk code is checked out into nutch-trunk.
- 
- 
- == Prerequisites ==
-  * apt-get install sun-java6-jdk subversion ant patch unzip
  
  == Ubuntu Note ==
  


[Nutch Wiki] Trivial Update of FrontPage by LewisJohnMcgibbney

2011-06-24 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The FrontPage page has been changed by LewisJohnMcgibbney:
http://wiki.apache.org/nutch/FrontPage?action=diffrev1=189rev2=190

   * Current CommandLineOptions /!\ :TODO:Missing pages to be added to 
accommodate new commands in Nutch 1.3 release also available content for 
existing commands to be updated to include new parameters  /!\ 
   * [[http://nutch.apache.org/apidocs-1.3/index.html|JavaDocs]] -- The 
!JavaDocs for Nutch-1.3 release.
  === Tutorials ===
+  * Running Nutch 1.3 with Solr Integration 
-  * RunningNutchAndSolr - How to configure Nutch 1.3 to crawl and post to 
Apache Solr for search/index /!\ :TODO:This tutorial is being updated to 
accomodate changes to Nutch 1.3 release /!\ 
+  - How to configure Nutch 1.3 to crawl and post to Apache Solr for 
search/index /!\ :TODO:This tutorial is being updated to accomodate changes to 
Nutch 1.3 release /!\ 
  === Configuration ===
   * OverviewDeploymentConfigs
   * NutchConfigurationFiles


[Nutch Wiki] Trivial Update of FrontPage by LewisJohnMcgibbney

2011-06-24 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The FrontPage page has been changed by LewisJohnMcgibbney:
http://wiki.apache.org/nutch/FrontPage?action=diffrev1=192rev2=193

   * Current CommandLineOptions /!\ :TODO:Missing pages to be added to 
accommodate new commands in Nutch 1.3 release also available content for 
existing commands to be updated to include new parameters  /!\ 
   * [[http://nutch.apache.org/apidocs-1.3/index.html|JavaDocs]] -- The 
!JavaDocs for Nutch-1.3 release.
  === Tutorials ===
-  * Nutch1.3WithSolrIntegration - How to configure Nutch 1.3 to crawl and post 
to Apache Solr for search/index /!\ :TODO:This tutorial is being updated to 
accomodate changes to Nutch 1.3 release /!\ 
+  * RunningNutchAndSolr - How to configure Nutch 1.3 to crawl and post to 
Apache Solr for search/index /!\ :TODO:This tutorial is being updated to 
accomodate changes to Nutch 1.3 release /!\ 
  === Configuration ===
   * OverviewDeploymentConfigs
   * NutchConfigurationFiles


[Nutch Wiki] Trivial Update of RunningNutchAndSolr by LewisJohnMcgibbney

2011-06-24 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The RunningNutchAndSolr page has been changed by LewisJohnMcgibbney:
http://wiki.apache.org/nutch/RunningNutchAndSolr?action=diffrev1=50rev2=51

  ## page was renamed from Nutch1.3WithSolrIntegration
  ## page was renamed from Running Nutch 1.3 with Solr Integration
  ## page was renamed from RunningNutchAndSolr
+ ## Lang: En
+ =RunningNutchAndSolr=
+ 
  This tutorial was originally constructed and posted by 'waycool' on the user 
lists. It has been edited slightly for integration into the Apache Nutch 
project.
  
+ Apache Nutch is an open source web crawler written in Java. By using it, we 
can find out the hyperlinks in automated manner, reduce lots of maintenance 
work, for example checking broken links, and create a copy of all the visited 
pages for future search. That’s where Apache Solr comes in. Solr is an open 
source full text search framework, with Solr we can search the visited pages 
from Nutch. Luckily, integration between Nutch and Solr is pretty 
straightforward as explained below.
- = Notes about Nutch 1.3 =
- Please note that Apache Nutch release 1.3 has Solr integration embedded, this 
greatly eases Nutch-Solr integration. Just download release 1.3 from 
[[http://www.apache.org/dyn/closer.cgi/nutch/|here]]. This also removes the 
legacy dependence upon both Apache Tomcat for running the old Nutch WebApp and 
upon Lucene for indexing
  
+ Apache Nutch release 1.3 has Solr integration embedded, this greatly eases 
Nutch-Solr integration. It also removes the legacy dependence upon both Apache 
Tomcat for running the old Nutch Web Application and upon Apache Lucene for 
indexing. Just download a 1.3 release from 
[[http://www.apache.org/dyn/closer.cgi/nutch/|here]]. NOTE: You can download 
release 1.3 in either binary or source format, both of which are covered in 
this tutorial.
+  
- == Ubuntu Note ==
- 
- If you are using more recent versions of Ubuntu Solr comes as a package 
installable through apt-get 
- 
- {{{
- sudo apt-get install solr-tomcat
- }}}
- 
- A more in-depth howto for Ubuntu Server 10.04 Lucid Lynx is available here: 
http://ubuntuforums.org/showthread.php?p=9596257
- 
- You might wish to install it that way instead of as follows. If so then you 
will find the solr config in /etc/solr/conf 
- and the web interface can be found at http://localhost:8080/solr/
- 
  == Steps ==
- The first step to get started is to download the required software 
components, namely Apache Solr and Nutch.
- 
- '''1.''' Download Solr version 1.3.0 or LucidWorks for Solr from Download page
- 
- '''2.''' Extract Solr package
+ Setup Nutch from binary distribution:
+ '''1a.''' Unzip your binary Nutch package to $HOME/nutch-1.3
+   cd $HOME/nutch-1.3/runtime/local 
+ Setup Nutch from source distribution:
+ '''1b.''' Unzip your source package to $HOME/nutch-1.3-src 
+   cd $HOME/nutch-1.3-src 
+   run “ant” command. 
+   It should generate a directory called $HOME/nutch-1.3-src/runtime. 
+   cd $HOME/nutch-1.3-src/runtime/local 
  
  '''3.''' Download Nutch version 1.0 or later (Alternatively download the the 
nightly version of Nutch that contains the required functionality)
  


[Nutch Wiki] Trivial Update of RunningNutchAndSolr by LewisJohnMcgibbney

2011-06-24 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The RunningNutchAndSolr page has been changed by LewisJohnMcgibbney:
http://wiki.apache.org/nutch/RunningNutchAndSolr?action=diffrev1=51rev2=52

  ## page was renamed from Running Nutch 1.3 with Solr Integration
  ## page was renamed from RunningNutchAndSolr
  ## Lang: En
- =RunningNutchAndSolr=
- 
  This tutorial was originally constructed and posted by 'waycool' on the user 
lists. It has been edited slightly for integration into the Apache Nutch 
project.
  
  Apache Nutch is an open source web crawler written in Java. By using it, we 
can find out the hyperlinks in automated manner, reduce lots of maintenance 
work, for example checking broken links, and create a copy of all the visited 
pages for future search. That’s where Apache Solr comes in. Solr is an open 
source full text search framework, with Solr we can search the visited pages 
from Nutch. Luckily, integration between Nutch and Solr is pretty 
straightforward as explained below.
@@ -13, +11 @@

  Apache Nutch release 1.3 has Solr integration embedded, this greatly eases 
Nutch-Solr integration. It also removes the legacy dependence upon both Apache 
Tomcat for running the old Nutch Web Application and upon Apache Lucene for 
indexing. Just download a 1.3 release from 
[[http://www.apache.org/dyn/closer.cgi/nutch/|here]]. NOTE: You can download 
release 1.3 in either binary or source format, both of which are covered in 
this tutorial.
   
  == Steps ==
- Setup Nutch from binary distribution:
+ '''1a.''' Setup Nutch from binary distribution:
- '''1a.''' Unzip your binary Nutch package to $HOME/nutch-1.3
+   i.  Unzip your binary Nutch package to $HOME/nutch-1.3
-   cd $HOME/nutch-1.3/runtime/local 
+   ii. cd $HOME/nutch-1.3/runtime/local 
- Setup Nutch from source distribution:
+ '''1b.''' Setup Nutch from source distribution:
- '''1b.''' Unzip your source package to $HOME/nutch-1.3-src 
+   i.   Unzip your source package to $HOME/nutch-1.3-src 
-   cd $HOME/nutch-1.3-src 
+   ii.  cd $HOME/nutch-1.3-src 
-   run “ant” command. 
+   iii. run “ant” command. 
-   It should generate a directory called $HOME/nutch-1.3-src/runtime. 
+   iv.  It should generate a directory called 
$HOME/nutch-1.3-src/runtime. 
-   cd $HOME/nutch-1.3-src/runtime/local 
+   v.   cd $HOME/nutch-1.3-src/runtime/local 
+ 
+ From now on, we am going to use ${NUTCH_RUNTIME_HOME} to refer to the current 
directory.
  
  '''3.''' Download Nutch version 1.0 or later (Alternatively download the the 
nightly version of Nutch that contains the required functionality)
  


[Nutch Wiki] Trivial Update of RunningNutchAndSolr by LewisJohnMcgibbney

2011-06-24 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The RunningNutchAndSolr page has been changed by LewisJohnMcgibbney:
http://wiki.apache.org/nutch/RunningNutchAndSolr?action=diffrev1=55rev2=56

  Usage: nutch [-core] COMMAND
  }}}
  
-  Some troubleshooting tips:
+ Some troubleshooting tips:
   * Run the following command if you are seeing Permission denied:
  {{{
  chmod +x bin/nutch
  }}}
   * Setup JAVA_HOME if you are seeing JAVA_HOME not set. On Mac, you can run 
the following command or add it to ~/.bashrc:
+ {{{
  export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.6/Home
+ }}}
  
  '''4.''' Extract the Nutch package   tar xzf apache-nutch-1.0.tar.gz
  


[Nutch Wiki] Trivial Update of RunningNutchAndSolr by LewisJohnMcgibbney

2011-06-24 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The RunningNutchAndSolr page has been changed by LewisJohnMcgibbney:
http://wiki.apache.org/nutch/RunningNutchAndSolr?action=diffrev1=58rev2=59

Comment:
This revision is a first attempt at getting a local

  
  '''4a.''' Setup Solr for search from binary distribution:
   * download binary file from 
[[http://www.apache.org/dyn/closer.cgi/lucene/solr/|here]]
-  * unzip to $HOME/apache-solr-3.X
-  * cd apache-solr-3.X/example
+  * unzip to $HOME/apache-solr-3.X, we will now refer to this as 
${APACHE_SOLR_HOME}
+  * cd ${APACHE_SOLR_HOME}/example
   * java -jar start.jar
  
  '''4b.''' Setup Solr for search from source distribution:
   * You can setup Solr from source distribution with Maven. This 
[[http://thetechietutorials.blogspot.com/2011/06/how-to-build-and-start-apache-solr.html|link]]
 shows how to do that.
  
+ '''5.''' Verify Solr installation:
+ After you started Solr admin console, you should be able to access the 
following links:
- 
- '''a.''' Copy the provided Nutch schema from directory apache-nutch-1.0/conf 
to directory apache-solr-1.3.0/example/solr/conf (override the existing file)
- 
- We want to allow Solr to create the snippets for search results so we need to 
store the content in addition to indexing it:
- 
- '''b.''' Change schema.xml so that the stored attribute of field “content” is 
true.
- 
  {{{
- field name=”content” type=”text” stored=”true” indexed=”true”/
+ http://localhost:8983/solr/admin/
+ http://localhost:8983/solr/admin/stats.jsp
  }}}
  
+ '''6.''' Integrate Solr with Nutch
+ We have both Nutch and Solr installed and setup correctly. And Nutch already 
created crawl data from the seed url(s). Below are the steps to delagte 
searching to Solr for links to be searchable:
+  * cp ${NUTCH_RUNTIME_HOME}/conf/schema.xml 
${APACHE_SOLR_HOME}/example/solr/conf/ 
+  * restart Solr with the command “java -jar start.jar” under 
${APACHE_SOLR_HOME}/example 
+  * run the Solr Index command:
- We want to be able to tweak the relevancy of queries easily so we’ll create 
new [[http://wiki.apache.org/solr/DisMaxRequestHandler|dismax request handler]] 
configuration for our use case:
- 
- '''d.''' Open apache-solr-1.3.0/example/solr/conf/solrconfig.xml and paste 
following fragment to it
- 
- {{{
- requestHandler name=/nutch class=solr.SearchHandler 
- lst name=defaults
- str name=defTypedismax/str
- str name=echoParamsexplicit/str
- float name=tie0.01/float
- str name=qf
- content#94;0.5 anchor#94;1.0 title#94;1.2 /str
- str name=pf content#94;0.5 anchor#94;1.5 title#94;1.2 site#94;1.5 
/str
- str name=fl url /str
- str name=mm 2lt;-1 5lt;-2 6lt;90% /str
- int name=ps100/int
- bool name=hltrue/bool
- str name=q.alt*:*/str
- str name=hl.fltitle url content/str
- str name=f.title.hl.fragsize0/str
- str name=f.title.hl.alternateFieldtitle/str
- str name=f.url.hl.fragsize0/str
- str name=f.url.hl.alternateFieldurl/str
- str name=f.content.hl.fragmenterregex/str
- /lst
- /requestHandler
- }}}
- 
- 
- '''6.''' Start Solr
- 
- Assuming you have installed Solr as per instructions above. 
- {{{
- cd apache-solr-1.3.0/example java -jar start.jar
- }}}
- 
- 
- 
- '''7'''. Configure Nutch
- 
- a. Open nutch-site.xml in directory apache-nutch-1.0/conf, replace it’s 
contents with the following (we specify our crawler name, active plugins and 
limit maximum url count for single host per run to be 100) :
- 
- {{{
- ?xml version=1.0? configuration
- property
- namehttp.agent.name/name
- valuenutch-solr-integration/value
- /property
- property namegenerate.max.per.host/name
- value100/value
- /property
- property
- nameplugin.includes/name
- 
valueprotocol-http|urlfilter-regex|parse-html|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)/value
- /property
- /configuration
- }}}
- 
- 
- '''b.''' Open regex-urlfilter.txt in directory apache-nutch-1.0/conf,replace 
it’s content with something similar to the following:
- 
- {{{
- -^(https|telnet|file|ftp|mailto):
- # skip some suffixes 
- 
-\.(swf|SWF|doc|DOC|mp3|MP3|WMV|wmv|txt|TXT|rtf|RTF|avi|AVI|m3u|M3U|flv|FLV|WAV|wav|mp4|MP4|avi|AVI|rss|RSS|xml|XML|pdf|PDF|js|JS|gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
- # skip URLs containing certain characters as probable queries, etc. 
- -[?*!@=]
- # allow urls in foofactory.fi domain (or lucidimagination.com...)
- +^http://([a-z0-9\-A-Z]*\.)*lucidimagination.com/
- # deny anything else 
- -.
- }}}
- 
- 
- '''8.''' Create a seed list (the initial urls to fetch)
- 
- {{{
- mkdir urls 
- echo http://www.lucidimagination.com/;  urls/seed.txt
- }}}
- 
- '''9.''' Inject seed url(s) to nutch crawldb (execute in nutch directory)
- 
- {{{
- bin/nutch inject crawl/crawldb urls
- }}}
- 
- '''10.''' Generate fetch list, fetch and parse content
- 
- {{{
- bin/nutch generate 

[jira] [Updated] (NUTCH-987) Support HTTP auth for Solr communication

2011-06-24 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-987:


Fix Version/s: 1.4

 Support HTTP auth for Solr communication
 

 Key: NUTCH-987
 URL: https://issues.apache.org/jira/browse/NUTCH-987
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.4, 2.0

 Attachments: NUTCH-987-1.3-hack.patch


 At the moment we cannot send data directly to a public HTTP auth protected 
 Solr instance. I've a WIP that passes a configured HTTPClient object to 
 CommonsHttpSolrServer, it works. This issue should add this ability to 
 indexing, dedup and clean and be configured from some configuration file.
 The question is, is the current httpclient-auth.xml the correct place? It 
 does provide a nice means to configure the AuthScope objects but it is used 
 for fetching. But, since AuthScope is used we could easily add the 
 credentials for Solr there as well and add a new nutch-default option for 
 toggling HTTP auth.
 Thoughts?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira