Re: [Nutch-dev] Indexing the Interesting Part Only...

2007-03-11 Thread Michael Wechner
SIGNATURE- Version: GnuPG v1.4.7 (Darwin) iD8DBQFF812mgz0R1bg11MERAqXCAKCVTfLN7KXJYdAqLGWMI57ChKaM8QCfdQBc 1CyrQfD+5vCzSBvYbviX17o= =+TK/ -END PGP SIGNATURE- -- Michael Wechner Wyona - Open Source Content Management -Apache Lenya http://www.wyona.com

Re: [Nutch-dev] Indexing the Interesting Part Only...

2007-03-11 Thread Michael Wechner
Michael Wechner wrote: d e wrote: I'm sorry! I guess I was REALLY not clear. I mean my problem is to drop the junk *on each page*. I am indexing news sites. I want to harvest news STORIES, not the advertisements and other junk text around the outside of each page. Got suggestions

[Nutch-dev] Getting a semantic version of an HTML page

2007-02-06 Thread Michael Wechner
than the HTML itself ?xml version=1.0? semantic-of href=index.html ... /semantic-of resp. some RDF or whatever. Any pointers are very welcome. Thanks Michi -- Michael Wechner Wyona - Open Source Content Management -Apache Lenya http://www.wyona.com http

Re: [Nutch-dev] SynonymEditor

2007-01-17 Thread Michael Wechner
Krebs, Urs wrote: Hi list, I made a tool who can modify and create .owl files as synonymlists. I don't know where to put. May I add it to jira I think JIRA would be good for a start Cheers Michael or could you use it directly in nutch? Urs -- Michael Wechner Wyona - Open

[Nutch-dev] [jira] Created: (NUTCH-418) Fixes parsing of XHTML (e.g. title)

2006-12-21 Thread Michael Wechner (JIRA)
Reporter: Michael Wechner Fixes parsing of XHTML (e.g. title) -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http

[Nutch-dev] [jira] Updated: (NUTCH-418) Fixes parsing of XHTML (e.g. title)

2006-12-21 Thread Michael Wechner (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-418?page=all ] Michael Wechner updated NUTCH-418: -- Attachment: parse-xhtml-patch.txt patch which fixes the mime-type Fixes parsing of XHTML (e.g. title

Re: [Nutch-dev] Extracting title from XHTML pages

2006-12-21 Thread Michael Wechner
Michael Wechner wrote: Sami Siren wrote: Michael Wechner wrote: Hi It seems to me that Nutch 0.8.x cannot extract the title from an XHTML page, e.g. Try changing the following in your parse-plugins.xml mimeType name=application/xhtml+xml plugin id=parse-html

Re: [Nutch-dev] Extracting title from XHTML pages

2006-12-21 Thread Michael Wechner
Michael Wechner wrote: I have added a patch https://issues.apache.org/jira/secure/ManageAttachments.jspa?id=12359202 sorry, I actually meant https://issues.apache.org/jira/browse/NUTCH-418 Cheers Michi Thanks Michi Cheers Michi -- Sami Siren -- Michael Wechner

[Nutch-dev] Extracting title from XHTML pages

2006-12-20 Thread Michael Wechner
: org.apache.nutch.parse.text.TextParser mapped to contentType application/xhtml+xml via parse-plugins.xml, but its plugin.xml file does not claim to support contentType: application/xhtml+xml Can anyone confirm this resp. shall I add a bug entry? Thanks Michi -- Michael Wechner Wyona - Open Source Content Management

Re: [Nutch-dev] Extracting title from XHTML pages

2006-12-20 Thread Michael Wechner
Sami Siren wrote: Michael Wechner wrote: Hi It seems to me that Nutch 0.8.x cannot extract the title from an XHTML page, e.g. Try changing the following in your parse-plugins.xml mimeType name=application/xhtml+xml plugin id=parse-html / /mimeType

[Nutch-dev] difference between intranet and internet crawling

2006-12-20 Thread Michael Wechner
, but what do you mean by intranet and internet crawling? In the end both of them are just URLs ... right? It seems to me I completely misunderstand something. Thanks for a hint Michi -- Michael Wechner Wyona - Open Source Content Management -Apache Lenya http://www.wyona.com

Re: [Nutch-dev] File system watching for intranets

2006-09-13 Thread Michael Wechner
questions/whatever on this subject is appreciated as I would like to come up with a more optimal solution for us intranet nutch users. Ben -- Michael Wechner Wyona - Open Source Content Management -Apache Lenya http://www.wyona.com http://lenya.apache.org [EMAIL

Re: [Nutch-dev] Ontology compile bug

2006-09-08 Thread Michael Wechner
just commented, hence the minor difference of the two slashes ;-) HTH Michi Otis - Original Message From: Michael Wechner [EMAIL PROTECTED] To: nutch-dev@lucene.apache.org Sent: Tuesday, August 22, 2006 9:07:12 AM Subject: Ontology compile bug Hi It seems to me that refine-query

Re: [Nutch-dev] Checking if crawl dir exists ...

2006-08-26 Thread Michael Wechner
Hasan Diwan wrote: On 25/08/06, Michael Wechner [EMAIL PROTECTED] wrote: Index: nutch-0.8/src/java/org/apache/nutch/searcher/NutchBean.java === --- nutch-0.8/src/java/org/apache/nutch/searcher/NutchBean.java (Revision

[Nutch-dev] Checking if crawl dir exists ...

2006-08-25 Thread Michael Wechner
(dir.toString()).exists()) { +LOG.warn(No such directory: + new java.io.File(dir.toString())); +} Path servers = new Path(dir, search-servers.txt); if (fs.exists(servers)) { if (LOG.isInfoEnabled()) { WDYT? Thanks Michi -- Michael Wechner Wyona

Re: [Nutch-dev] Checking if crawl dir exists ...

2006-08-25 Thread Michael Wechner
a misconfigured searcher.dir either. So, it can be very confusing, especially for beginners, because one starts scratching and looking what might be the problem and actually the problem is quite simple. Enough motivation ;-) ? HTH Michi Stefan Am 25.08.2006 um 06:52 schrieb Michael Wechner: Hi I think

Re: [Nutch-dev] HTTP Accept Header seems to be missing

2006-08-17 Thread Michael Wechner
Sami Siren wrote: Michael Wechner wrote: Hi It seems to me that Nutch does not send a HTTP Accept Header. Is that on purpose? I would have expected that Nutch tells the server which mime-types it accepts resp. is able to parse and index, but maybe I misunderstand something

[Nutch-dev] HTTP Accept Header seems to be missing

2006-08-16 Thread Michael Wechner
Hi It seems to me that Nutch does not send a HTTP Accept Header. Is that on purpose? I would have expected that Nutch tells the server which mime-types it accepts resp. is able to parse and index, but maybe I misunderstand something. Thanks Michi -- Michael Wechner Wyona - Open

[Nutch-dev] Allowing search from command line

2006-08-14 Thread Michael Wechner
to test the crawl more quickly than first having to setup the WAR file. WDYT? Thanks Michi -- Michael Wechner Wyona - Open Source Content Management -Apache Lenya http://www.wyona.com http://lenya.apache.org [EMAIL PROTECTED][EMAIL

Re: [Nutch-dev] Library for extracting text content from binaries

2006-07-24 Thread Michael Wechner
-- Yukatan - http://yukatan.fi/ - [EMAIL PROTECTED] Software craftsmanship, JCR consulting, and Java development -- Michael Wechner Wyona - Open Source Content Management -Apache Lenya http://www.wyona.com http://lenya.apache.org [EMAIL PROTECTED

Re: [Nutch-dev] IncrediBILL's Random Rants: How Much Nutch is TOO MUCH Nutch?

2006-06-15 Thread Michael Wechner
starts crawling the site. Michi -- Michael Wechner Wyona - Open Source Content Management -Apache Lenya http://www.wyona.com http://lenya.apache.org [EMAIL PROTECTED][EMAIL PROTECTED] +41 44 272 91 61

[Nutch-dev] Re: hyperbolic browser api (I missed)

2005-09-22 Thread Michael Wechner
... -- Michael Wechner Wyona - Open Source Content Management -Apache Lenya http://www.wyona.com http://lenya.apache.org [EMAIL PROTECTED][EMAIL PROTECTED] --- SF.Net email is sponsored by: Tame

[Nutch-dev] Re: hyperbolic browser api (I missed)

2005-09-22 Thread Michael Wechner
please apologize for sending this private message to the mailing list. Thanks Michi Michael Wechner wrote: Hey Gavin It's quite some time since we met in San Francisco. How are you? Hope all is well. All the best Michael Gavin Thomas Nicol wrote: On Sep 21, 2005, at 11:55 AM, Jack

Re: [Nutch-dev] Re: [EMAIL PROTECTED] Mailinglist

2005-04-22 Thread Michael Wechner
Doug Cutting wrote: Should I send a final notice asking folks to join the Apache list, and then shut down the sourceforge list? Well, I think people will move quickly to the Apache ML if the sourceforge one is being shut down ;-) Michi -- Michael Wechner Wyona Inc. - Open Source Content

[Nutch-dev] Re: Starting the webapp and finding the segments

2005-04-20 Thread Michael Wechner
Doug Cutting wrote: Michael Wechner wrote: one needs to start the servlet container within the directory where the segments directory is located. Because I often forget this I receive a NullPointerException and then after some time I remember that I started Tomcat from the wrong directory. I

[Nutch-dev] [EMAIL PROTECTED] Mailinglist

2005-04-20 Thread Michael Wechner
a reason that [EMAIL PROTECTED] still exists? Thanks Michi -- Michael Wechner Wyona Inc. - Open Source Content Management - Apache Lenya http://www.wyona.com http://lenya.apache.org [EMAIL PROTECTED][EMAIL PROTECTED

Re: [Nutch-dev] [jira] Commented: (NUTCH-42) enhance search.jsp such that it can also returns XML

2005-04-16 Thread Michael Wechner
Doug Cutting wrote: Michael Wechner wrote: Doug Cutting commented on NUTCH-42: --- I prefer we have a servlet that generates only XML, and then generate HTML from this XML. Do you dislike that approach for some reason? no, not at all. I much more prefer

[Nutch-dev] Starting the webapp and finding the segments

2005-04-16 Thread Michael Wechner
started Tomcat from the wrong directory. I think it would make sense if the segments directory could be specified within the web.xml and also in case the segments directory cannot be found a nice Exception would be thrown telling one what might be wrong. WDYT? Thanks Michi -- Michael Wechner

[Nutch-dev] [jira] Updated: (NUTCH-42) enhance search.jsp such that it can also returns XML

2005-04-16 Thread Michael Wechner (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-42?page=history ] Michael Wechner updated NUTCH-42: - Attachment: search.jsp.diff Add RSS link to search.jsp poiting to the OpenSearch servlet enhance search.jsp such that it can also returns XML

Re: [Nutch-dev] [jira] Commented: (NUTCH-42) enhance search.jsp such that it can also returns XML

2005-04-15 Thread Michael Wechner
search.jsp such that it can also returns XML Key: NUTCH-42 URL: http://issues.apache.org/jira/browse/NUTCH-42 Project: Nutch Type: Wish Components: web gui Reporter: Michael Wechner Priority: Trivial Attachments

[Nutch-dev] [jira] Created: (NUTCH-41) Replace CVS by SVN within tutorial of Documentation

2005-04-14 Thread Michael Wechner (JIRA)
Replace CVS by SVN within tutorial of Documentation --- Key: NUTCH-41 URL: http://issues.apache.org/jira/browse/NUTCH-41 Project: Nutch Type: Bug Reporter: Michael Wechner Priority: Trivial Attachments

[Nutch-dev] [jira] Updated: (NUTCH-41) Replace CVS by SVN within tutorial of Documentation

2005-04-14 Thread Michael Wechner (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-41?page=history ] Michael Wechner updated NUTCH-41: - Attachment: tutorial.xml.diff Replace CVS by SVN within tutorial of Documentation --- Key

[Nutch-dev] [jira] Created: (NUTCH-42) enhance search.jsp such that it can also returns XML

2005-04-14 Thread Michael Wechner (JIRA)
enhance search.jsp such that it can also returns XML Key: NUTCH-42 URL: http://issues.apache.org/jira/browse/NUTCH-42 Project: Nutch Type: Wish Components: web gui Reporter: Michael Wechner Priority

[Nutch-dev] Bot information within server log

2005-04-11 Thread Michael Wechner
to Apache. WDYT? Michi -- Michael Wechner Wyona Inc. - Open Source Content Management - Apache Lenya http://www.wyona.com http://lenya.apache.org [EMAIL PROTECTED][EMAIL PROTECTED] --- SF email

[Nutch-dev] svn:ignore

2005-03-29 Thread Michael Wechner
-- Michael Wechner Wyona Inc. - Open Source Content Management - Apache Lenya http://www.wyona.com http://lenya.apache.org [EMAIL PROTECTED][EMAIL PROTECTED] --- SF email is sponsored

Re: [Nutch-dev] Re: svn:ignore

2005-03-29 Thread Michael Wechner
Doug Cutting wrote: as svn:ignore parameters Done. that was quick :-) Thanks Michi -- Michael Wechner Wyona Inc. - Open Source Content Management - Apache Lenya http://www.wyona.com http://lenya.apache.org [EMAIL PROTECTED][EMAIL PROTECTED

[Nutch-dev] Re: Starting a non-profit organisation running Nutch with a thousand or more sponsored servers

2005-03-17 Thread Michael Wechner
this would make crawling obsolete to a certain point (at least for pages being created by content management systems). Thanks Michi -- Michael Wechner Wyona Inc. - Open Source Content Management - Apache Lenya http://www.wyona.com http://lenya.apache.org [EMAIL PROTECTED

Re: [Nutch-dev] Starting a non-profit organisation running Nutch with a thousand or more sponsored servers

2005-03-17 Thread Michael Wechner
:[EMAIL PROTECTED] On Behalf Of Michael Wechner Sent: Wednesday, March 16, 2005 6:09 PM To: [EMAIL PROTECTED] Subject: [Nutch-dev] Starting a non-profit organisation running Nutch with a thousand or more sponsored servers Hi I was recently thinking that it would be fun to start a non-profit

Re: [Nutch-dev] Starting a non-profit organisation running Nutch with a thousand or more sponsored servers

2005-03-17 Thread Michael Wechner
, because it's a very central point re the web. Thanks Michi Otis --- Michael Wechner [EMAIL PROTECTED] wrote: Hi I was recently thinking that it would be fun to start a non-profit organization in order to run Nutch as a really transparent and open search engine, very similar as for instance

Re: [Nutch-dev] Re: Re: Starting a non-profit organisation ...

2005-03-17 Thread Michael Wechner
=6595alloc_id=14396op=click ___ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers -- Michael Wechner Wyona Inc. - Open Source Content Management - Apache Lenya http

Re: [Nutch-dev] Re: Re: Starting a non-profit organisation ...

2005-03-17 Thread Michael Wechner
Michael Wechner wrote: could help here to make sure, that the organization would dissolve sorry, I meant would not because of money or other issues, but I guess that's another challenge ;-) Michi -- Michael Wechner Wyona Inc. - Open Source Content Management - Apache Lenya http

Notifying Nutch about content changes [WAS: Re: [Nutch-dev] Re: Starting a non-profit organisation running Nutch with a thousand or more sponsored servers]

2005-03-17 Thread Michael Wechner
David Spencer wrote: Michael Wechner wrote: Stefan Groschupf wrote: have you collected these offers somewhere? Check the source-forge mail archive. thanks, will do. btw, is there an interface within Nutch, where a CMS (e.g. Apache Lenya) can notify Nutch about content changes (or deletion

[Nutch-dev] Starting a non-profit organisation running Nutch with a thousand or more sponsored servers

2005-03-16 Thread Michael Wechner
to challenge for instance Google. A 1000 servers is quite a lot of money, but one server is affordable by all kind of people and companies. I am aware that servers is not the only thing, but I would be interested what the community thinks about such an infrastructure project. Thanks Michi -- Michael