[jira] [Created] (NUTCH-1013) Migrate RegexURLNormalizer from Apache ORO to java.util.regex
Migrate RegexURLNormalizer from Apache ORO to java.util.regex - Key: NUTCH-1013 URL: https://issues.apache.org/jira/browse/NUTCH-1013 Project: Nutch Issue Type: Improvement Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.4, 2.0 Apache ORO uses old Perl 5-style regular expressions. Features such as the powerful lookbehind are not available. The project has become retired as well. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1013) Migrate RegexURLNormalizer from Apache ORO to java.util.regex
[ https://issues.apache.org/jira/browse/NUTCH-1013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1013: - Attachment: NUTCH-1013-1.4.patch Patch for RegexURLNormalizer for 1.4. Seems to work fine. Migrate RegexURLNormalizer from Apache ORO to java.util.regex - Key: NUTCH-1013 URL: https://issues.apache.org/jira/browse/NUTCH-1013 Project: Nutch Issue Type: Improvement Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.4, 2.0 Attachments: NUTCH-1013-1.4.patch Apache ORO uses old Perl 5-style regular expressions. Features such as the powerful lookbehind are not available. The project has become retired as well. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1011) Normalize duplicate slashes in URL's
[ https://issues.apache.org/jira/browse/NUTCH-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13054438#comment-13054438 ] Markus Jelsma commented on NUTCH-1011: -- This normalizer works with NUTCH-1013. {code} !-- removes duplicate slashes -- regex pattern(?lt;!:)/{2,}/pattern substitution//substitution /regex {code} Normalize duplicate slashes in URL's Key: NUTCH-1011 URL: https://issues.apache.org/jira/browse/NUTCH-1011 Project: Nutch Issue Type: Improvement Affects Versions: 1.4, 2.0 Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Minor Attachments: NUTCH-1011-all-3.patch Many websites produce faulty URL's with multiple slashes e.g. http://cocoon.apache.org///1.x/dynamic.html This can be really nasty if the number of slashes varies, resulting in many URL's actually pointing to the same page and generating new (unique) URL's to the same or other duplicate pages. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Issue Comment Edited] (NUTCH-1013) Migrate RegexURLNormalizer from Apache ORO to java.util.regex
[ https://issues.apache.org/jira/browse/NUTCH-1013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13054437#comment-13054437 ] Markus Jelsma edited comment on NUTCH-1013 at 6/24/11 1:30 PM: --- Patch for RegexURLNormalizer for 1.4. Seems to work fine. It also compiles against trunk. Unit tests pass as well. Are there objections? Thinks to take special care off? was (Author: markus17): Patch for RegexURLNormalizer for 1.4. Seems to work fine. It also compiles against trunk. Migrate RegexURLNormalizer from Apache ORO to java.util.regex - Key: NUTCH-1013 URL: https://issues.apache.org/jira/browse/NUTCH-1013 Project: Nutch Issue Type: Improvement Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.4, 2.0 Attachments: NUTCH-1013-1.4.patch Apache ORO uses old Perl 5-style regular expressions. Features such as the powerful lookbehind are not available. The project has become retired as well. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1013) Migrate RegexURLNormalizer from Apache ORO to java.util.regex
[ https://issues.apache.org/jira/browse/NUTCH-1013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13054471#comment-13054471 ] Ken Krugler commented on NUTCH-1013: No comment directly related to this patch, but URL normalization seems like a great component to move into crawler-commons, since all web crawlers need to do the same thing. Migrate RegexURLNormalizer from Apache ORO to java.util.regex - Key: NUTCH-1013 URL: https://issues.apache.org/jira/browse/NUTCH-1013 Project: Nutch Issue Type: Improvement Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.4, 2.0 Attachments: NUTCH-1013-1.4.patch Apache ORO uses old Perl 5-style regular expressions. Features such as the powerful lookbehind are not available. The project has become retired as well. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1012) Cannot handle illegal charset $charset
[ https://issues.apache.org/jira/browse/NUTCH-1012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13054475#comment-13054475 ] Ken Krugler commented on NUTCH-1012: Tika has code to try to resolve charset names (and handle common error cases) in a graceful manner. Nutch might want to use this code, or we could add a general wrapper to crawler-commons. See CharsetUtils in Tika. Cannot handle illegal charset $charset -- Key: NUTCH-1012 URL: https://issues.apache.org/jira/browse/NUTCH-1012 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.3 Reporter: Markus Jelsma Priority: Minor Fix For: 1.4 Pages returning: {code} Content-Type: text/html; charset=$charset {code} cause: {code} Error parsing: http://host/: failed(2,200): java.nio.charset.IllegalCharsetNameException: $charset Found a TextHeaderAtom not followed by a TextBytesAtom or TextCharsAtom: Followed by 3999 ParseSegment: finished at 2011-06-24 01:14:54, elapsed: 00:01:12 {code} Stack trace: {code} 2011-06-24 01:14:23,442 WARN parse.html - java.nio.charset.IllegalCharsetNameException: $charset 2011-06-24 01:14:23,442 WARN parse.html - at java.nio.charset.Charset.checkName(Charset.java:284) 2011-06-24 01:14:23,442 WARN parse.html - at java.nio.charset.Charset.lookup2(Charset.java:458) 2011-06-24 01:14:23,442 WARN parse.html - at java.nio.charset.Charset.lookup(Charset.java:437) 2011-06-24 01:14:23,442 WARN parse.html - at java.nio.charset.Charset.isSupported(Charset.java:479) 2011-06-24 01:14:23,442 WARN parse.html - at org.apache.nutch.util.EncodingDetector.resolveEncodingAlias(EncodingDetector.java:310) 2011-06-24 01:14:23,442 WARN parse.html - at org.apache.nutch.util.EncodingDetector.addClue(EncodingDetector.java:201) 2011-06-24 01:14:23,442 WARN parse.html - at org.apache.nutch.util.EncodingDetector.addClue(EncodingDetector.java:208) 2011-06-24 01:14:23,442 WARN parse.html - at org.apache.nutch.util.EncodingDetector.autoDetectClues(EncodingDetector.java:193) 2011-06-24 01:14:23,442 WARN parse.html - at org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:138) 2011-06-24 01:14:23,442 WARN parse.html - at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35) 2011-06-24 01:14:23,443 WARN parse.html - at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24) 2011-06-24 01:14:23,443 WARN parse.html - at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) 2011-06-24 01:14:23,443 WARN parse.html - at java.util.concurrent.FutureTask.run(FutureTask.java:138) 2011-06-24 01:14:23,443 WARN parse.html - at java.lang.Thread.run(Thread.java:662) 2011-06-24 01:14:23,443 WARN parse.ParseSegment - Error parsing: http://host/: failed(2,200): java.nio.charset.Ill egalCharsetNameException: $charset {code} -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (NUTCH-1014) Migrate from Apache ORO to java.util.regex
Migrate from Apache ORO to java.util.regex -- Key: NUTCH-1014 URL: https://issues.apache.org/jira/browse/NUTCH-1014 Project: Nutch Issue Type: Improvement Reporter: Markus Jelsma Fix For: 1.4, 2.0 A separate issue tracking migration of all components from Apache ORO to java.util.regex. Components involved are: - RegexURLNormalzier - OutlinkExtractor - JSParseFilter - MoreIndexingFilter - BasicURLNormalizer -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[Nutch Wiki] Trivial Update of RunningNutchAndSolr by LewisJohnMcgibbney
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The RunningNutchAndSolr page has been changed by LewisJohnMcgibbney: http://wiki.apache.org/nutch/RunningNutchAndSolr?action=diffrev1=45rev2=46 = New in Nutch 1.3 = - Please note that in the nightly version of Apache Nutch there is now a Solr integration embedded so you can start to use a lot easier. Just download a nightly version from http://hudson.zones.apache.org/hudson/job/Nutch-trunk/. + Please note that Apache Nutch release 1.3 now has Solr integration embedded, this greatly eases Nutch-Solr integration. Just download release 1.3 from [[http://www.apache.org/dyn/closer.cgi/nutch/|here]]. = Pre Solr Nutch integration = This is just a quick first pass at a guide for getting Nutch running with Solr. I'm sure there are better ways of doing some/all of it, but I'm not aware of them. By all means, please do correct/update this if someone has a better idea. Many thanks to http://variogram.com and http://blog.foofactory.fi for all the help! You guys saved me a lot of time! :)
[Nutch Wiki] Trivial Update of RunningNutchAndSolr by LewisJohnMcgibbney
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The RunningNutchAndSolr page has been changed by LewisJohnMcgibbney: http://wiki.apache.org/nutch/RunningNutchAndSolr?action=diffrev1=46rev2=47 + This tutorial was originally constructed and posted by 'waycool' on the user lists. It has been edited slightly for integration into the Apache Nutch project. - = New in Nutch 1.3 = - Please note that Apache Nutch release 1.3 now has Solr integration embedded, this greatly eases Nutch-Solr integration. Just download release 1.3 from [[http://www.apache.org/dyn/closer.cgi/nutch/|here]]. + = Notes about Nutch 1.3 = + Please note that Apache Nutch release 1.3 has Solr integration embedded, this greatly eases Nutch-Solr integration. Just download release 1.3 from [[http://www.apache.org/dyn/closer.cgi/nutch/|here]]. This also removes the legacy dependence upon both Apache Tomcat for running the old Nutch WebApp and upon Lucene for indexing - = Pre Solr Nutch integration = - This is just a quick first pass at a guide for getting Nutch running with Solr. I'm sure there are better ways of doing some/all of it, but I'm not aware of them. By all means, please do correct/update this if someone has a better idea. Many thanks to http://variogram.com and http://blog.foofactory.fi for all the help! You guys saved me a lot of time! :) - - I'm posting it under Nutch rather than Solr on the presumption that people are more likely to be learning/using Solr first, then come here looking to combine it with Nutch. I'm going to skip over doing command by command for right now. I'm running/building on Ubuntu 7.10 using Java 1.6.0_05. I'm assuming that the Solr trunk code is checked out into solr-trunk and Nutch trunk code is checked out into nutch-trunk. - - - == Prerequisites == - * apt-get install sun-java6-jdk subversion ant patch unzip == Ubuntu Note ==
[Nutch Wiki] Trivial Update of FrontPage by LewisJohnMcgibbney
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The FrontPage page has been changed by LewisJohnMcgibbney: http://wiki.apache.org/nutch/FrontPage?action=diffrev1=189rev2=190 * Current CommandLineOptions /!\ :TODO:Missing pages to be added to accommodate new commands in Nutch 1.3 release also available content for existing commands to be updated to include new parameters /!\ * [[http://nutch.apache.org/apidocs-1.3/index.html|JavaDocs]] -- The !JavaDocs for Nutch-1.3 release. === Tutorials === + * Running Nutch 1.3 with Solr Integration - * RunningNutchAndSolr - How to configure Nutch 1.3 to crawl and post to Apache Solr for search/index /!\ :TODO:This tutorial is being updated to accomodate changes to Nutch 1.3 release /!\ + - How to configure Nutch 1.3 to crawl and post to Apache Solr for search/index /!\ :TODO:This tutorial is being updated to accomodate changes to Nutch 1.3 release /!\ === Configuration === * OverviewDeploymentConfigs * NutchConfigurationFiles
[Nutch Wiki] Trivial Update of FrontPage by LewisJohnMcgibbney
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The FrontPage page has been changed by LewisJohnMcgibbney: http://wiki.apache.org/nutch/FrontPage?action=diffrev1=192rev2=193 * Current CommandLineOptions /!\ :TODO:Missing pages to be added to accommodate new commands in Nutch 1.3 release also available content for existing commands to be updated to include new parameters /!\ * [[http://nutch.apache.org/apidocs-1.3/index.html|JavaDocs]] -- The !JavaDocs for Nutch-1.3 release. === Tutorials === - * Nutch1.3WithSolrIntegration - How to configure Nutch 1.3 to crawl and post to Apache Solr for search/index /!\ :TODO:This tutorial is being updated to accomodate changes to Nutch 1.3 release /!\ + * RunningNutchAndSolr - How to configure Nutch 1.3 to crawl and post to Apache Solr for search/index /!\ :TODO:This tutorial is being updated to accomodate changes to Nutch 1.3 release /!\ === Configuration === * OverviewDeploymentConfigs * NutchConfigurationFiles
[Nutch Wiki] Trivial Update of RunningNutchAndSolr by LewisJohnMcgibbney
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The RunningNutchAndSolr page has been changed by LewisJohnMcgibbney: http://wiki.apache.org/nutch/RunningNutchAndSolr?action=diffrev1=50rev2=51 ## page was renamed from Nutch1.3WithSolrIntegration ## page was renamed from Running Nutch 1.3 with Solr Integration ## page was renamed from RunningNutchAndSolr + ## Lang: En + =RunningNutchAndSolr= + This tutorial was originally constructed and posted by 'waycool' on the user lists. It has been edited slightly for integration into the Apache Nutch project. + Apache Nutch is an open source web crawler written in Java. By using it, we can find out the hyperlinks in automated manner, reduce lots of maintenance work, for example checking broken links, and create a copy of all the visited pages for future search. That’s where Apache Solr comes in. Solr is an open source full text search framework, with Solr we can search the visited pages from Nutch. Luckily, integration between Nutch and Solr is pretty straightforward as explained below. - = Notes about Nutch 1.3 = - Please note that Apache Nutch release 1.3 has Solr integration embedded, this greatly eases Nutch-Solr integration. Just download release 1.3 from [[http://www.apache.org/dyn/closer.cgi/nutch/|here]]. This also removes the legacy dependence upon both Apache Tomcat for running the old Nutch WebApp and upon Lucene for indexing + Apache Nutch release 1.3 has Solr integration embedded, this greatly eases Nutch-Solr integration. It also removes the legacy dependence upon both Apache Tomcat for running the old Nutch Web Application and upon Apache Lucene for indexing. Just download a 1.3 release from [[http://www.apache.org/dyn/closer.cgi/nutch/|here]]. NOTE: You can download release 1.3 in either binary or source format, both of which are covered in this tutorial. + - == Ubuntu Note == - - If you are using more recent versions of Ubuntu Solr comes as a package installable through apt-get - - {{{ - sudo apt-get install solr-tomcat - }}} - - A more in-depth howto for Ubuntu Server 10.04 Lucid Lynx is available here: http://ubuntuforums.org/showthread.php?p=9596257 - - You might wish to install it that way instead of as follows. If so then you will find the solr config in /etc/solr/conf - and the web interface can be found at http://localhost:8080/solr/ - == Steps == - The first step to get started is to download the required software components, namely Apache Solr and Nutch. - - '''1.''' Download Solr version 1.3.0 or LucidWorks for Solr from Download page - - '''2.''' Extract Solr package + Setup Nutch from binary distribution: + '''1a.''' Unzip your binary Nutch package to $HOME/nutch-1.3 + cd $HOME/nutch-1.3/runtime/local + Setup Nutch from source distribution: + '''1b.''' Unzip your source package to $HOME/nutch-1.3-src + cd $HOME/nutch-1.3-src + run “ant” command. + It should generate a directory called $HOME/nutch-1.3-src/runtime. + cd $HOME/nutch-1.3-src/runtime/local '''3.''' Download Nutch version 1.0 or later (Alternatively download the the nightly version of Nutch that contains the required functionality)
[Nutch Wiki] Trivial Update of RunningNutchAndSolr by LewisJohnMcgibbney
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The RunningNutchAndSolr page has been changed by LewisJohnMcgibbney: http://wiki.apache.org/nutch/RunningNutchAndSolr?action=diffrev1=51rev2=52 ## page was renamed from Running Nutch 1.3 with Solr Integration ## page was renamed from RunningNutchAndSolr ## Lang: En - =RunningNutchAndSolr= - This tutorial was originally constructed and posted by 'waycool' on the user lists. It has been edited slightly for integration into the Apache Nutch project. Apache Nutch is an open source web crawler written in Java. By using it, we can find out the hyperlinks in automated manner, reduce lots of maintenance work, for example checking broken links, and create a copy of all the visited pages for future search. That’s where Apache Solr comes in. Solr is an open source full text search framework, with Solr we can search the visited pages from Nutch. Luckily, integration between Nutch and Solr is pretty straightforward as explained below. @@ -13, +11 @@ Apache Nutch release 1.3 has Solr integration embedded, this greatly eases Nutch-Solr integration. It also removes the legacy dependence upon both Apache Tomcat for running the old Nutch Web Application and upon Apache Lucene for indexing. Just download a 1.3 release from [[http://www.apache.org/dyn/closer.cgi/nutch/|here]]. NOTE: You can download release 1.3 in either binary or source format, both of which are covered in this tutorial. == Steps == - Setup Nutch from binary distribution: + '''1a.''' Setup Nutch from binary distribution: - '''1a.''' Unzip your binary Nutch package to $HOME/nutch-1.3 + i. Unzip your binary Nutch package to $HOME/nutch-1.3 - cd $HOME/nutch-1.3/runtime/local + ii. cd $HOME/nutch-1.3/runtime/local - Setup Nutch from source distribution: + '''1b.''' Setup Nutch from source distribution: - '''1b.''' Unzip your source package to $HOME/nutch-1.3-src + i. Unzip your source package to $HOME/nutch-1.3-src - cd $HOME/nutch-1.3-src + ii. cd $HOME/nutch-1.3-src - run “ant” command. + iii. run “ant” command. - It should generate a directory called $HOME/nutch-1.3-src/runtime. + iv. It should generate a directory called $HOME/nutch-1.3-src/runtime. - cd $HOME/nutch-1.3-src/runtime/local + v. cd $HOME/nutch-1.3-src/runtime/local + + From now on, we am going to use ${NUTCH_RUNTIME_HOME} to refer to the current directory. '''3.''' Download Nutch version 1.0 or later (Alternatively download the the nightly version of Nutch that contains the required functionality)
[Nutch Wiki] Trivial Update of RunningNutchAndSolr by LewisJohnMcgibbney
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The RunningNutchAndSolr page has been changed by LewisJohnMcgibbney: http://wiki.apache.org/nutch/RunningNutchAndSolr?action=diffrev1=55rev2=56 Usage: nutch [-core] COMMAND }}} - Some troubleshooting tips: + Some troubleshooting tips: * Run the following command if you are seeing Permission denied: {{{ chmod +x bin/nutch }}} * Setup JAVA_HOME if you are seeing JAVA_HOME not set. On Mac, you can run the following command or add it to ~/.bashrc: + {{{ export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.6/Home + }}} '''4.''' Extract the Nutch package tar xzf apache-nutch-1.0.tar.gz
[Nutch Wiki] Trivial Update of RunningNutchAndSolr by LewisJohnMcgibbney
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The RunningNutchAndSolr page has been changed by LewisJohnMcgibbney: http://wiki.apache.org/nutch/RunningNutchAndSolr?action=diffrev1=58rev2=59 Comment: This revision is a first attempt at getting a local '''4a.''' Setup Solr for search from binary distribution: * download binary file from [[http://www.apache.org/dyn/closer.cgi/lucene/solr/|here]] - * unzip to $HOME/apache-solr-3.X - * cd apache-solr-3.X/example + * unzip to $HOME/apache-solr-3.X, we will now refer to this as ${APACHE_SOLR_HOME} + * cd ${APACHE_SOLR_HOME}/example * java -jar start.jar '''4b.''' Setup Solr for search from source distribution: * You can setup Solr from source distribution with Maven. This [[http://thetechietutorials.blogspot.com/2011/06/how-to-build-and-start-apache-solr.html|link]] shows how to do that. + '''5.''' Verify Solr installation: + After you started Solr admin console, you should be able to access the following links: - - '''a.''' Copy the provided Nutch schema from directory apache-nutch-1.0/conf to directory apache-solr-1.3.0/example/solr/conf (override the existing file) - - We want to allow Solr to create the snippets for search results so we need to store the content in addition to indexing it: - - '''b.''' Change schema.xml so that the stored attribute of field “content” is true. - {{{ - field name=”content” type=”text” stored=”true” indexed=”true”/ + http://localhost:8983/solr/admin/ + http://localhost:8983/solr/admin/stats.jsp }}} + '''6.''' Integrate Solr with Nutch + We have both Nutch and Solr installed and setup correctly. And Nutch already created crawl data from the seed url(s). Below are the steps to delagte searching to Solr for links to be searchable: + * cp ${NUTCH_RUNTIME_HOME}/conf/schema.xml ${APACHE_SOLR_HOME}/example/solr/conf/ + * restart Solr with the command “java -jar start.jar” under ${APACHE_SOLR_HOME}/example + * run the Solr Index command: - We want to be able to tweak the relevancy of queries easily so we’ll create new [[http://wiki.apache.org/solr/DisMaxRequestHandler|dismax request handler]] configuration for our use case: - - '''d.''' Open apache-solr-1.3.0/example/solr/conf/solrconfig.xml and paste following fragment to it - - {{{ - requestHandler name=/nutch class=solr.SearchHandler - lst name=defaults - str name=defTypedismax/str - str name=echoParamsexplicit/str - float name=tie0.01/float - str name=qf - content#94;0.5 anchor#94;1.0 title#94;1.2 /str - str name=pf content#94;0.5 anchor#94;1.5 title#94;1.2 site#94;1.5 /str - str name=fl url /str - str name=mm 2lt;-1 5lt;-2 6lt;90% /str - int name=ps100/int - bool name=hltrue/bool - str name=q.alt*:*/str - str name=hl.fltitle url content/str - str name=f.title.hl.fragsize0/str - str name=f.title.hl.alternateFieldtitle/str - str name=f.url.hl.fragsize0/str - str name=f.url.hl.alternateFieldurl/str - str name=f.content.hl.fragmenterregex/str - /lst - /requestHandler - }}} - - - '''6.''' Start Solr - - Assuming you have installed Solr as per instructions above. - {{{ - cd apache-solr-1.3.0/example java -jar start.jar - }}} - - - - '''7'''. Configure Nutch - - a. Open nutch-site.xml in directory apache-nutch-1.0/conf, replace it’s contents with the following (we specify our crawler name, active plugins and limit maximum url count for single host per run to be 100) : - - {{{ - ?xml version=1.0? configuration - property - namehttp.agent.name/name - valuenutch-solr-integration/value - /property - property namegenerate.max.per.host/name - value100/value - /property - property - nameplugin.includes/name - valueprotocol-http|urlfilter-regex|parse-html|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)/value - /property - /configuration - }}} - - - '''b.''' Open regex-urlfilter.txt in directory apache-nutch-1.0/conf,replace it’s content with something similar to the following: - - {{{ - -^(https|telnet|file|ftp|mailto): - # skip some suffixes - -\.(swf|SWF|doc|DOC|mp3|MP3|WMV|wmv|txt|TXT|rtf|RTF|avi|AVI|m3u|M3U|flv|FLV|WAV|wav|mp4|MP4|avi|AVI|rss|RSS|xml|XML|pdf|PDF|js|JS|gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$ - # skip URLs containing certain characters as probable queries, etc. - -[?*!@=] - # allow urls in foofactory.fi domain (or lucidimagination.com...) - +^http://([a-z0-9\-A-Z]*\.)*lucidimagination.com/ - # deny anything else - -. - }}} - - - '''8.''' Create a seed list (the initial urls to fetch) - - {{{ - mkdir urls - echo http://www.lucidimagination.com/; urls/seed.txt - }}} - - '''9.''' Inject seed url(s) to nutch crawldb (execute in nutch directory) - - {{{ - bin/nutch inject crawl/crawldb urls - }}} - - '''10.''' Generate fetch list, fetch and parse content - - {{{ - bin/nutch generate
[jira] [Updated] (NUTCH-987) Support HTTP auth for Solr communication
[ https://issues.apache.org/jira/browse/NUTCH-987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-987: Fix Version/s: 1.4 Support HTTP auth for Solr communication Key: NUTCH-987 URL: https://issues.apache.org/jira/browse/NUTCH-987 Project: Nutch Issue Type: Improvement Components: indexer Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.4, 2.0 Attachments: NUTCH-987-1.3-hack.patch At the moment we cannot send data directly to a public HTTP auth protected Solr instance. I've a WIP that passes a configured HTTPClient object to CommonsHttpSolrServer, it works. This issue should add this ability to indexing, dedup and clean and be configured from some configuration file. The question is, is the current httpclient-auth.xml the correct place? It does provide a nice means to configure the AuthScope objects but it is used for fetching. But, since AuthScope is used we could easily add the credentials for Solr there as well and add a new nutch-default option for toggling HTTP auth. Thoughts? -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira