[jira] Created: (NUTCH-891) Nutch build should not depend on unversioned local deps
Nutch build should not depend on unversioned local deps --- Key: NUTCH-891 URL: https://issues.apache.org/jira/browse/NUTCH-891 Project: Nutch Issue Type: Bug Reporter: Andrzej Bialecki The fix in NUTCH-873 introduces an unknown variable to the build process. Since local ivy artifacts are unversioned, different people that install Gora jars at different points in time will use the same artifact id but in fact the artifacts (jars) will differ because they will come from different revisions of Gora sources. Therefore Nutch builds based on the same svn rev. won't be repeatable across different environments. As much as it pains the ivy purists ;) until Gora publishes versioned artifacts I'd like to revert the fix in NUTCH-873 and add again Gora jars built from a known external rev. We can add a README that contains commit id from Gora. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[Nutch Wiki] Trivial Update of JavaDemoApplication by Cristian Vulpe
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The JavaDemoApplication page has been changed by Cristian Vulpe. http://wiki.apache.org/nutch/JavaDemoApplication?action=diffrev1=14rev2=15 -- With that, all is ready and we can now write some simple code to search. A quick example in Java to search the crawl index and return the number of hits found is: {{{ - package com.siemens.scr.sgcm.service; - - import java.util.Date; // necessary imports import org.apache.hadoop.conf.Configuration; @@ -124, +121 @@ import org.apache.nutch.searcher.NutchBean; import org.apache.nutch.searcher.Query; import org.apache.nutch.util.NutchConfiguration; + import java.util.Date; public class Search { public static void main(String[] args) {
[Nutch Wiki] Update of JavaDemoApplication by Cristia n Vulpe
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The JavaDemoApplication page has been changed by Cristian Vulpe. http://wiki.apache.org/nutch/JavaDemoApplication?action=diffrev1=15rev2=16 -- ?xml-stylesheet type=text/xsl href=configuration.xsl? configuration - property + property - nameplugin.folders/name + nameplugin.folders/name - value${nutch.site.plugin.folders} + value${nutch.site.plugin.folders} - /value - description / - /property + /value + description / + /property - property - namesearcher.dir/name + property + namesearcher.dir/name - value${nutch.site.searcher.dir}/value + value${nutch.site.searcher.dir}/value - description / - /property + description / + /property /configuration }}} + and run the java application using the appropriate parameters: - and run the java application using the appropriate parameters: {{{ -Dnutch.site.plugin.folders=c:\tools\crawlers\apache-nutch-1.1-bin\plugins -Dnutch.site.searcher.dir=c:\tools\crawlers\apache-nutch-1.1-bin\crawl }}} - === CLASSPATH Configuration === - You also need to make sure that the following jars are placed in WEB-INF/lib (this assumes usage of Nutch 0.9): {{{ @@ -70, +68 @@ lucene-misc-2.2.0.jar nutch-0.9.jar }}} - For a standalone application, one might want to use Apache maven (this configuration assumes Nutch 1.1). At the moment of writing this note, Nutch does not publish its artifacts to maven. However we (members of community) hope that maven support will be added soon. In the meantime, just install the nutch-1.1.jar to your maven repository. Here is a snippet that will manage the dependencies that you need to run this example (note that the 1.1-XXX version of Nutch marks the fact that the artifact cannot be found in any public repository yet): {{{ dependency - groupIdorg.apache.nutch/groupId + groupIdorg.apache.nutch/groupId - artifactIdnutch/artifactId + artifactIdnutch/artifactId - version1.1-XXX/version + version1.1-XXX/version /dependency dependency - groupIdorg.apache.hadoop/groupId + groupIdorg.apache.hadoop/groupId - artifactIdhadoop-core/artifactId + artifactIdhadoop-core/artifactId - version0.20.2/version + version0.20.2/version /dependency dependency - groupIdorg.apache.lucene/groupId + groupIdorg.apache.lucene/groupId - artifactIdlucene-core/artifactId + artifactIdlucene-core/artifactId - version3.0.1/version + version3.0.1/version - scoperuntime/scope + scoperuntime/scope /dependency dependency - groupIdorg.apache.lucene/groupId + groupIdorg.apache.lucene/groupId - artifactIdlucene-misc/artifactId + artifactIdlucene-misc/artifactId - version3.0.1/version + version3.0.1/version - scoperuntime/scope + scoperuntime/scope /dependency dependency - groupIdcommons-lang/groupId + groupIdcommons-lang/groupId - artifactIdcommons-lang/artifactId + artifactIdcommons-lang/artifactId - version2.1/version + version2.1/version - scoperuntime/scope + scoperuntime/scope /dependency }}} - == Sample code == With that, all is ready and we can now write some simple code to search. A quick example in Java to search the crawl index and return the number of hits found is: {{{ - // necessary imports import org.apache.hadoop.conf.Configuration; import org.apache.nutch.searcher.Hit; @@ -124, +119 @@ import java.util.Date; public class Search { - public static void main(String[] args) { + public static void main(String[] args) { - try { - // define a keyword for the search - String nutchSearchString = smart; + try { + // define a keyword for the search + String nutchSearchString = smart; - // configure nutch + // configure nutch - Configuration nutchConf = NutchConfiguration.create(); + Configuration nutchConf = NutchConfiguration.create(); - NutchBean nutchBean = new NutchBean(nutchConf); + NutchBean nutchBean = new NutchBean(nutchConf); - // build the query + // build the query - Query nutchQuery = Query.parse(nutchSearchString, nutchConf); + Query
[Nutch Wiki] Trivial Update of JavaDemoApplication by Cristian Vulpe
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The JavaDemoApplication page has been changed by Cristian Vulpe. http://wiki.apache.org/nutch/JavaDemoApplication?action=diffrev1=16rev2=17 -- } } }}} - Extra information about developing a standalone application that does the search can be obtained by inspecting the main method in org.apache.nutch.searcher.NutchBean. === Authors === + Chaz Hickman (Jan 2008) - Chaz Hickman (Jan 2008) Cristi Vulpe (Aug 2010)
[jira] Commented: (NUTCH-891) Nutch build should not depend on unversioned local deps
[ https://issues.apache.org/jira/browse/NUTCH-891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12900283#action_12900283 ] Chris A. Mattmann commented on NUTCH-891: - Hi Andrzej: Can I get some clarificatioin on this? First, local Ivy jars are versioned, by artifact id and by version #. So, we're talking about gora-0.1.jar here, right? So, your point is, if I'm off gitting and developing on Gora, at any point in time, I can run ant on gora and then it updates my local Ivy repo with a gora-0.1.jar file, right? And your point is, this file is different than the previous gora-0.1.jar file (-N minutes ago), and so thus, Nutch isn't really depending on a stable version, right? If the above is true, what it suggests to me is that perhaps the process of installing Gora as a local Ivy dependency (independent of Nutch's deps) needs a bit more discipline. I'd say, why not make the Gora Ant build publish a gora-0.1-some snapshot id aka SVN rev or UUID or whatever.jar? In that fashion, you could develop on Gora, without fear of changing anything in the way that Nutch depends on it (because the 0.1 version that Nutch depends on could be frozen as is). I'd also be a fan if the above isn't true or doesn't make sense, of actually just uploading Gora to Maven central -- can we try that? Cheers, Chris P.S. I'm not trying to be difficult about NUTCH-873 b/c I was the one who did it. If in the end the consensus is to revert it, no egos here, go ahead. I'm just trying to figure out a solution to the problem that allows us to use Ivy as it should be and to not have to make exceptions. My other thought along these lines is that if we can't wrap our heads around Ivy, or getting to Maven Central in any short amount of time, then what about pulling Gora into Nutch SVN? It's ASLv2 licensed and there is nothing against doing this. From there, there would be a clean path to move to Incubation since the code would already be in Apache SVN anyways... Nutch build should not depend on unversioned local deps --- Key: NUTCH-891 URL: https://issues.apache.org/jira/browse/NUTCH-891 Project: Nutch Issue Type: Bug Reporter: Andrzej Bialecki The fix in NUTCH-873 introduces an unknown variable to the build process. Since local ivy artifacts are unversioned, different people that install Gora jars at different points in time will use the same artifact id but in fact the artifacts (jars) will differ because they will come from different revisions of Gora sources. Therefore Nutch builds based on the same svn rev. won't be repeatable across different environments. As much as it pains the ivy purists ;) until Gora publishes versioned artifacts I'd like to revert the fix in NUTCH-873 and add again Gora jars built from a known external rev. We can add a README that contains commit id from Gora. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Does ivy fetch all dependancies for trunk?
Hi Alex, Unfortunately because Gora isn't published anywhere, Ivy won't automatically fetch Gora for you. But, it will automatically include it in your build, if it's in your local Ivy repository. Steps to make that happen: 1. git clone git://github.com/enis/gora.git 2. cd gora 3. ant Then, after doing that, Nutch should build fine... Cheers, Chris On 8/19/10 12:56 AM, Alex McLintock alex.mclint...@gmail.com wrote: I'm trying to get Nutch 2.0 (dev) fetched out of trunk. Does it compile right now? Currently it complains it can't find org.gora#gora-core;0.1 What worries me is that I don't know whether this is a misconfiguration on my part - or something else externally has gone wrong. Alex a...@reynolds:~/projects/nutch-2/trunk$ svn update At revision 987056. a...@reynolds:~/projects/nutch-2/trunk$ ant Buildfile: build.xml ivy-probe-antlib: ivy-download: -ivy-download-unchecked: ivy-init-antlib: ivy-init: init: clean-lib: resolve-default: [ivy:resolve] :: Ivy 2.1.0 - 20090925235825 :: http://ant.apache.org/ivy/ :: [ivy:resolve] :: loading settings :: file = /home/alex/projects/nutch-2/trunk/ivy/ivysettings.xml [ivy:resolve] [ivy:resolve] :: problems summary :: [ivy:resolve] WARNINGS [ivy:resolve] module not found: org.gora#gora-core;0.1 [ivy:resolve] local: tried [ivy:resolve] /home/alex/.ivy2/local/org.gora/gora-core/0.1/ivys/ivy.xml [ivy:resolve] -- artifact org.gora#gora-core;0.1!gora-core.jar: [ivy:resolve] /home/alex/.ivy2/local/org.gora/gora-core/0.1/jars/gora-core.jar [ivy:resolve] :: [ivy:resolve] :: UNRESOLVED DEPENDENCIES :: [ivy:resolve] :: [ivy:resolve] :: org.gora#gora-core;0.1: not found [ivy:resolve] :: [ivy:resolve] [ivy:resolve] :: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS BUILD FAILED /home/alex/projects/nutch-2/trunk/build.xml:319: impossible to resolve dependencies: resolve failed - see output for details Total time: 2 seconds ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.mattm...@jpl.nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
Re: Does ivy fetch all dependancies for trunk?
+1...or better yet: 1. create a JIRA to include the info from NUTCH-873 inside of README.txt. 2. create a patch for README.txt to include this information 3. attach patch to newly created JIRA Thanks! Cheers, Chris On 8/19/10 2:40 AM, Alex McLintock alex.mclint...@gmail.com wrote: On 19 August 2010 09:39, Evgeniy Serykh s...@openteam.ru wrote: Try to look at this issue https://issues.apache.org/jira/browse/NUTCH-873 Cheers Perhaps I ought to write a README.txt :-) Alex ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.mattm...@jpl.nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
Re: Does ivy fetch all dependancies for trunk?
Ahh, I hadn't seen this before I commented on NUTCH-891. At the same time, I think my comments over there still stand. This isn't a problem with NUTCH-873, it's a problem with the way that Gora is built. I don't think checking in the jars and defeating the purpose of using {Ivy/Maven2/pick your favorite dep mgmt system} is the solution to this - the solution IMHO is disciplined management of a local Ivy/Maven2 repository and installation of Gora to it... Cheers, Chris On 8/19/10 3:47 AM, Andrzej Bialecki a...@getopt.org wrote: On 2010-08-19 11:40, Alex McLintock wrote: On 19 August 2010 09:39, Evgeniy Serykhs...@openteam.ru wrote: Try to look at this issue https://issues.apache.org/jira/browse/NUTCH-873 Cheers Perhaps I ought to write a README.txt :-) Hold on - the fix committed in that issue should be reverted IMHO. I'll open a new issue for this. (Background: Nutch build process cannot rely on unversioned local artifacts, because then the builds are not repeatable across environments - you execute 'git pull' today and install the jars into your ivy2/local, someone commits new code to Gora, then I execute 'git pull' and kaboom - things are different, there may be different errors or bugs showing up, because our local versions of gora-core-0.1 are in fact different). -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.mattm...@jpl.nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
[jira] Commented: (NUTCH-891) Nutch build should not depend on unversioned local deps
[ https://issues.apache.org/jira/browse/NUTCH-891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12900290#action_12900290 ] Chris A. Mattmann commented on NUTCH-891: - bq. Sure, that would solve the problem for now - I'll bother the Gora devs, and you can create the patch, ok? Ultimately we should go with the other solution (publish to Maven), but it requires more involvement from Gora devs. I like it! lol. Sure, I'll try and create a patch to make it do that. Installing Gora forced me to figure out how to use git the other day, so why not figure out how to patch Gora! ^_^ bq. Neither am I, no egos here - I just find the current situation after the fix to be intractable, especially when doing bugfixing and testing - because even if APIs stay the same, hidden bugs may not be the same across revisions... I hear ya. OK let me think on this -- we definitely need a solution here. In the meanwhile I'll try and figure out how to patch Gora ant to make it version the jar on the Ivy install in a more meaningful way. Nutch build should not depend on unversioned local deps --- Key: NUTCH-891 URL: https://issues.apache.org/jira/browse/NUTCH-891 Project: Nutch Issue Type: Bug Reporter: Andrzej Bialecki The fix in NUTCH-873 introduces an unknown variable to the build process. Since local ivy artifacts are unversioned, different people that install Gora jars at different points in time will use the same artifact id but in fact the artifacts (jars) will differ because they will come from different revisions of Gora sources. Therefore Nutch builds based on the same svn rev. won't be repeatable across different environments. As much as it pains the ivy purists ;) until Gora publishes versioned artifacts I'd like to revert the fix in NUTCH-873 and add again Gora jars built from a known external rev. We can add a README that contains commit id from Gora. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-891) Nutch build should not depend on unversioned local deps
[ https://issues.apache.org/jira/browse/NUTCH-891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12900335#action_12900335 ] Enis Soztutar commented on NUTCH-891: - Of course the best way that Nutch uses Gora is that Gora publishes it's artifacts to Maven and Nutch uses ivy to fetch the jars. But Gora is still in heavy development and we need some more time to make a first release. Until then I think we can use the last commit sha1 in git for the revision number in git. We use this convention when uploading jars to guthub. Would that make sense? Nutch build should not depend on unversioned local deps --- Key: NUTCH-891 URL: https://issues.apache.org/jira/browse/NUTCH-891 Project: Nutch Issue Type: Bug Reporter: Andrzej Bialecki The fix in NUTCH-873 introduces an unknown variable to the build process. Since local ivy artifacts are unversioned, different people that install Gora jars at different points in time will use the same artifact id but in fact the artifacts (jars) will differ because they will come from different revisions of Gora sources. Therefore Nutch builds based on the same svn rev. won't be repeatable across different environments. As much as it pains the ivy purists ;) until Gora publishes versioned artifacts I'd like to revert the fix in NUTCH-873 and add again Gora jars built from a known external rev. We can add a README that contains commit id from Gora. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-892) nutch maven build support
nutch maven build support - Key: NUTCH-892 URL: https://issues.apache.org/jira/browse/NUTCH-892 Project: Nutch Issue Type: Improvement Components: build Affects Versions: 1.1 Reporter: Marius Cristian Vulpe Priority: Minor I use nutch search mechanism form a standalone java application. I use maven to configure my dependencies and I have seen that nutch doesn't publish any artifacts to the public repositories. Please let me know if somebody is working towards this direction. If not, I think I can spent some time to mavenize the project and I can send you a version of that (I plan to do that for version 1.1). I would need feedback on this. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
To nutch or not to nutch?
Hi there! I'm building a crawler that will understand some kind of pages. I want to be able to process a restricted group of websites. In essence, for example: I want to search for reviews of the products of my company in some blogs I well know. I don't know if Nutch can help me here. What I'm currently doing is a crawler that fetches pages, transforms them with the template designed for the site with xslt and the parses content. The question here is: Can this be done well with Nutch or will it imply a big overhead? What plugins will needs to be developed? Thank you!
[jira] Commented: (NUTCH-892) nutch maven build support
[ https://issues.apache.org/jira/browse/NUTCH-892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12900369#action_12900369 ] Julien Nioche commented on NUTCH-892: - see https://issues.apache.org/jira/browse/NUTCH-825 for a discussion on publishing Nutch artifacts and https://issues.apache.org/jira/browse/NUTCH-821 There was some for of consensus that the publication would be done manually (we don't release very often). We have decided to use IVY starting with Nutch 2.0 so introducing Maven on top of it is definitely -1 nutch maven build support - Key: NUTCH-892 URL: https://issues.apache.org/jira/browse/NUTCH-892 Project: Nutch Issue Type: Improvement Components: build Affects Versions: 1.1 Reporter: Marius Cristian Vulpe Priority: Minor Original Estimate: 120h Remaining Estimate: 120h I use nutch search mechanism form a standalone java application. I use maven to configure my dependencies and I have seen that nutch doesn't publish any artifacts to the public repositories. Please let me know if somebody is working towards this direction. If not, I think I can spent some time to mavenize the project and I can send you a version of that (I plan to do that for version 1.1). I would need feedback on this. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: To nutch or not to nutch?
Hello Gonzalo, Did you mean to post to the dev list? Further comments inline On 19 August 2010 18:25, Gonzalo Aguilar Delgado gagui...@aguilardelgado.com wrote: Hi there! I'm building a crawler that will understand some kind of pages. I want to be able to process a restricted group of websites. Nutch has the capability to configure a URL filter which can limit the hosts to a specific set of regular expressions. In essence, for example: I want to search for reviews of the products of my company in some blogs I well know. That sounds like a standard data mining requirement. I don't know if Nutch can help me here. Well, it can, but not out of the box. - It depends on what sort of automation you want. Nutch can crawl all those sites and build up a SolR/Lucene index for you to search through, but I am guessing that wont help you very much. What I'm currently doing is a crawler that fetches pages, transforms them with the template designed for the site with xslt Eh? you are using xslt to transform random web pages? Doesnt the xslt fall over whenever it finds non well formed xml? and the parses content. Parses it for what? What do you do with it? The question here is: Can this be done well with Nutch or will it imply a big overhead? I don't think this is *easy* with Nutch. The overhead may be worth it if you want to do the web crawling on a small cluster rather than one machine. There may be other better data mining tools, but I'm not sure I can recommend anything right now. What plugins will needs to be developed? Well that depends on what you want. Presumably you want something that identifies the web page as a review of your product so that it can be highlighted in the index. How do you want to do that? Thank you! I've been thinking about this for some time - but to search for book reviews instead of product reviews. I can't say that I have a working system, but maybe others do. Alex
[jira] Closed: (NUTCH-892) nutch maven build support
[ https://issues.apache.org/jira/browse/NUTCH-892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marius Cristian Vulpe closed NUTCH-892. --- Resolution: Won't Fix Julien, thanks for the quick response! I am looking forward for having the artifacts published to maven (Ivy is a good solution as well). I will close this one. nutch maven build support - Key: NUTCH-892 URL: https://issues.apache.org/jira/browse/NUTCH-892 Project: Nutch Issue Type: Improvement Components: build Affects Versions: 1.1 Reporter: Marius Cristian Vulpe Priority: Minor Original Estimate: 120h Remaining Estimate: 120h I use nutch search mechanism form a standalone java application. I use maven to configure my dependencies and I have seen that nutch doesn't publish any artifacts to the public repositories. Please let me know if somebody is working towards this direction. If not, I think I can spent some time to mavenize the project and I can send you a version of that (I plan to do that for version 1.1). I would need feedback on this. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-891) Nutch build should not depend on unversioned local deps
[ https://issues.apache.org/jira/browse/NUTCH-891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12900455#action_12900455 ] Andrzej Bialecki commented on NUTCH-891: - Yes, this would help. Nutch build should not depend on unversioned local deps --- Key: NUTCH-891 URL: https://issues.apache.org/jira/browse/NUTCH-891 Project: Nutch Issue Type: Bug Reporter: Andrzej Bialecki The fix in NUTCH-873 introduces an unknown variable to the build process. Since local ivy artifacts are unversioned, different people that install Gora jars at different points in time will use the same artifact id but in fact the artifacts (jars) will differ because they will come from different revisions of Gora sources. Therefore Nutch builds based on the same svn rev. won't be repeatable across different environments. As much as it pains the ivy purists ;) until Gora publishes versioned artifacts I'd like to revert the fix in NUTCH-873 and add again Gora jars built from a known external rev. We can add a README that contains commit id from Gora. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.