Re: hyperbolic browser api (I missed)

2005-09-22 Thread Michael Wechner
... -- Michael Wechner Wyona - Open Source Content Management -Apache Lenya http://www.wyona.com http://lenya.apache.org [EMAIL PROTECTED][EMAIL PROTECTED]

Re: hyperbolic browser api (I missed)

2005-09-22 Thread Michael Wechner
please apologize for sending this private message to the mailing list. Thanks Michi Michael Wechner wrote: Hey Gavin It's quite some time since we met in San Francisco. How are you? Hope all is well. All the best Michael Gavin Thomas Nicol wrote: On Sep 21, 2005, at 11:55 AM, Jack

Re: IncrediBILL's Random Rants: How Much Nutch is TOO MUCH Nutch?

2006-06-15 Thread Michael Wechner
starts crawling the site. Michi -- Michael Wechner Wyona - Open Source Content Management -Apache Lenya http://www.wyona.com http://lenya.apache.org [EMAIL PROTECTED][EMAIL PROTECTED] +41 44 272 91 61

Allowing search from command line

2006-08-14 Thread Michael Wechner
to test the crawl more quickly than first having to setup the WAR file. WDYT? Thanks Michi -- Michael Wechner Wyona - Open Source Content Management -Apache Lenya http://www.wyona.com http://lenya.apache.org [EMAIL PROTECTED][EMAIL

Re: Allowing search from command line

2006-08-14 Thread Michael Wechner
Michael Wechner wrote: which would allow to test the crawl more quickly than first having to setup the WAR file. WDYT? whereas I guess this would be similar to sh bin/nutch org.apache.nutch.searcher.NutchBean WORD resp. sh bin/nutch search WORD which I think would be nicer ;-) Michi

HTTP Accept Header seems to be missing

2006-08-16 Thread Michael Wechner
Hi It seems to me that Nutch does not send a HTTP Accept Header. Is that on purpose? I would have expected that Nutch tells the server which mime-types it accepts resp. is able to parse and index, but maybe I misunderstand something. Thanks Michi -- Michael Wechner Wyona - Open

Re: HTTP Accept Header seems to be missing

2006-08-17 Thread Michael Wechner
Sami Siren wrote: Michael Wechner wrote: Hi It seems to me that Nutch does not send a HTTP Accept Header. Is that on purpose? I would have expected that Nutch tells the server which mime-types it accepts resp. is able to parse and index, but maybe I misunderstand something

Checking if crawl dir exists ...

2006-08-25 Thread Michael Wechner
(dir.toString()).exists()) { +LOG.warn(No such directory: + new java.io.File(dir.toString())); +} Path servers = new Path(dir, search-servers.txt); if (fs.exists(servers)) { if (LOG.isInfoEnabled()) { WDYT? Thanks Michi -- Michael Wechner Wyona

Re: [Nutch-dev] Checking if crawl dir exists ...

2006-08-26 Thread Michael Wechner
Hasan Diwan wrote: On 25/08/06, Michael Wechner [EMAIL PROTECTED] wrote: Index: nutch-0.8/src/java/org/apache/nutch/searcher/NutchBean.java === --- nutch-0.8/src/java/org/apache/nutch/searcher/NutchBean.java (Revision 436787

Re: Ontology compile bug

2006-09-08 Thread Michael Wechner
just commented, hence the minor difference of the two slashes ;-) HTH Michi Otis - Original Message From: Michael Wechner [EMAIL PROTECTED] To: nutch-dev@lucene.apache.org Sent: Tuesday, August 22, 2006 9:07:12 AM Subject: Ontology compile bug Hi It seems to me that refine-query

Re: File system watching for intranets

2006-09-13 Thread Michael Wechner
/additional questions/whatever on this subject is appreciated as I would like to come up with a more optimal solution for us intranet nutch users. Ben -- Michael Wechner Wyona - Open Source Content Management -Apache Lenya http://www.wyona.com http://lenya.apache.org

Extracting title from XHTML pages

2006-12-20 Thread Michael Wechner
: org.apache.nutch.parse.text.TextParser mapped to contentType application/xhtml+xml via parse-plugins.xml, but its plugin.xml file does not claim to support contentType: application/xhtml+xml Can anyone confirm this resp. shall I add a bug entry? Thanks Michi -- Michael Wechner Wyona - Open Source Content Management

difference between intranet and internet crawling

2006-12-20 Thread Michael Wechner
, but what do you mean by intranet and internet crawling? In the end both of them are just URLs ... right? It seems to me I completely misunderstand something. Thanks for a hint Michi -- Michael Wechner Wyona - Open Source Content Management -Apache Lenya http://www.wyona.com

Re: Extracting title from XHTML pages

2006-12-21 Thread Michael Wechner
Michael Wechner wrote: Sami Siren wrote: Michael Wechner wrote: Hi It seems to me that Nutch 0.8.x cannot extract the title from an XHTML page, e.g. Try changing the following in your parse-plugins.xml mimeType name=application/xhtml+xml plugin id=parse-html

Re: Extracting title from XHTML pages

2006-12-21 Thread Michael Wechner
Michael Wechner wrote: I have added a patch https://issues.apache.org/jira/secure/ManageAttachments.jspa?id=12359202 sorry, I actually meant https://issues.apache.org/jira/browse/NUTCH-418 Cheers Michi Thanks Michi Cheers Michi -- Sami Siren -- Michael Wechner

Re: SynonymEditor

2007-01-17 Thread Michael Wechner
Krebs, Urs wrote: Hi list, I made a tool who can modify and create .owl files as synonymlists. I don't know where to put. May I add it to jira I think JIRA would be good for a start Cheers Michael or could you use it directly in nutch? Urs -- Michael Wechner Wyona

Re: Finished How to Become a Nutch Developer

2007-01-23 Thread Michael Wechner
a table of contents Cheers Michi Dennis Kubes -- Michael Wechner Wyona - Open Source Content Management -Apache Lenya http://www.wyona.com http://lenya.apache.org [EMAIL PROTECTED][EMAIL PROTECTED] +41 44 272 91 61

Getting a semantic version of an HTML page

2007-02-06 Thread Michael Wechner
than the HTML itself ?xml version=1.0? semantic-of href=index.html ... /semantic-of resp. some RDF or whatever. Any pointers are very welcome. Thanks Michi -- Michael Wechner Wyona - Open Source Content Management -Apache Lenya http://www.wyona.com http

Re: Indexing the Interesting Part Only...

2007-03-11 Thread Michael Wechner
: GnuPG v1.4.7 (Darwin) iD8DBQFF812mgz0R1bg11MERAqXCAKCVTfLN7KXJYdAqLGWMI57ChKaM8QCfdQBc 1CyrQfD+5vCzSBvYbviX17o= =+TK/ -END PGP SIGNATURE- -- Michael Wechner Wyona - Open Source Content Management -Apache Lenya http://www.wyona.com http

Re: Indexing the Interesting Part Only...

2007-03-11 Thread Michael Wechner
Michael Wechner wrote: d e wrote: I'm sorry! I guess I was REALLY not clear. I mean my problem is to drop the junk *on each page*. I am indexing news sites. I want to harvest news STORIES, not the advertisements and other junk text around the outside of each page. Got suggestions

[jira] Created: (NUTCH-418) Fixes parsing of XHTML (e.g. title)

2006-12-21 Thread Michael Wechner (JIRA)
Reporter: Michael Wechner Fixes parsing of XHTML (e.g. title) -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http

[jira] Updated: (NUTCH-418) Fixes parsing of XHTML (e.g. title)

2006-12-21 Thread Michael Wechner (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-418?page=all ] Michael Wechner updated NUTCH-418: -- Attachment: parse-xhtml-patch.txt patch which fixes the mime-type Fixes parsing of XHTML (e.g. title