Lewis, https://issues.apache.org/jira/browse/NUTCH-1068
That is the issue I filed about the patch (it isn't directly related to this, but it is related to some potential fixes). http://www.mail-archive.com/dev%40nutch.apache.org/msg03419.html That's the e-mail thread where I originally mentioned the modifications to automaton, and the patch with the backport of the Lucene fixes. Kirby On Fri, Nov 11, 2011 at 11:58 AM, Lewis John Mcgibbney <lewis.mcgibb...@gmail.com> wrote: > Excellent Kirby, thanks for this. > > The obvious question I guess... where does this leave us with regards to the > urlfilter-automation libraries? > > For the record as well, can you please provide the Jira you filed, it would > be good to know where I can begin with this one. > > Thanks > > On Thu, Nov 10, 2011 at 10:18 PM, Kirby Bohling <kirby.bohl...@gmail.com> > wrote: >> >> On Thu, Nov 10, 2011 at 6:14 PM, Lewis John Mcgibbney >> <lewis.mcgibb...@gmail.com> wrote: >> > OK so the required dependencies can be seen below >> > >> > - FeedParser <dependency org="net.java.dev.rome" name="rome" rev="1.0.0" >> > conf="*->master"/> >> > - URLAutomationFilter - <dependency org="dk.brics" name="automaton" >> > rev="???"/> >> > - SWFParser <dependency org="com.google.gwt" name="gwt-incubator" >> > rev="2.0.1"/> >> > - HTMLParser <dependency org="net.sourceforge.nekohtml" >> > name="nekohtml" >> > rev="1.9.15"/> >> > >> > There is a real nasty hack which would replace the usual ${nutch.root} >> > with >> > <include file="../../../ivy/ivy-configurations.xml"/> is possible, >> > however >> > this is not how I want to progress. >> > >> > I'm also not sure where to find the dk.brics dependency. >> >> The Automaton library to the best of my knowledge is not available via >> Maven's central repo. >> >> http://www.brics.dk/automaton/ is the site where you and find it. >> >> That's the location of the actual jar. >> http://www.brics.dk/automaton/automaton.jar >> >> In order to get the source you have to submit an e-mail address, but >> it is all available under the newer BSD/MIT license. >> >> I believe all of the functionality actually used by Nutch is in a >> faster form buried inside the Lucene Util library 4.0 (unreleased last >> I knew). I believe I filed an JIRA issue about my backport of the >> Lucene improvements to the library at Julian's request. I have >> submitted the code to the author, but I'm not sure if he has >> integrated it. He was short on time when I submitted all of it. >> >> It is a nice library, but it isn't very 3rd party user friendly (no >> bug tracker, no public source repo). >> >> Kirby >> >> >> > >> > Any thoughts? Jira issue? >> > >> > Thanks >> > >> > On Thu, Nov 10, 2011 at 12:39 AM, Andrzej Bialecki <a...@getopt.org> >> > wrote: >> >> >> >> On 10/11/2011 04:39, Lewis John Mcgibbney wrote: >> >>> >> >>> Gets even more strange, both SWFParser and AutomationURLFilter import >> >>> additonal depenedencies, however they are not included within thier >> >>> plugin/ivy/ivy.xml files! >> >>> >> >>> Am I missing something here? >> >> >> >> Most likely these problems come from the initial porting of a pure ant >> >> build to an ant+ivy build. We should determine what deps are really >> >> needed >> >> by these plugins, and sanitize the ivy.xml files so that they make >> >> sense - >> >> if the existing files can't be untangled we can ditch them and come up >> >> with >> >> new, clean ones. >> >> >> >> -- >> >> Best regards, >> >> Andrzej Bialecki <>< >> >> ___. ___ ___ ___ _ _ __________________________________ >> >> [__ || __|__/|__||\/| Information Retrieval, Semantic Web >> >> ___|||__|| \| || | Embedded Unix, System Integration >> >> http://www.sigram.com Contact: info at sigram dot com >> >> >> > >> > >> > >> > -- >> > Lewis >> > >> > > > > > -- > Lewis > >