...
--
Michael Wechner
Wyona - Open Source Content Management -Apache Lenya
http://www.wyona.com http://lenya.apache.org
[EMAIL PROTECTED][EMAIL PROTECTED]
please apologize for sending this private message to the mailing list.
Thanks
Michi
Michael Wechner wrote:
Hey Gavin
It's quite some time since we met in San Francisco.
How are you? Hope all is well.
All the best
Michael
Gavin Thomas Nicol wrote:
On Sep 21, 2005, at 11:55 AM, Jack
starts crawling the site.
Michi
--
Michael Wechner
Wyona - Open Source Content Management -Apache Lenya
http://www.wyona.com http://lenya.apache.org
[EMAIL PROTECTED][EMAIL PROTECTED]
+41 44 272 91 61
to test the crawl more quickly than first having to
setup the WAR file.
WDYT?
Thanks
Michi
--
Michael Wechner
Wyona - Open Source Content Management -Apache Lenya
http://www.wyona.com http://lenya.apache.org
[EMAIL PROTECTED][EMAIL
Michael Wechner wrote:
which would allow to test the crawl more quickly than first having
to setup the WAR file.
WDYT?
whereas I guess this would be similar to
sh bin/nutch org.apache.nutch.searcher.NutchBean WORD
resp.
sh bin/nutch search WORD
which I think would be nicer ;-)
Michi
Hi
It seems to me that Nutch does not send a HTTP Accept Header. Is that on
purpose?
I would have expected that Nutch tells the server which mime-types it
accepts resp. is able to parse and index,
but maybe I misunderstand something.
Thanks
Michi
--
Michael Wechner
Wyona - Open
Sami Siren wrote:
Michael Wechner wrote:
Hi
It seems to me that Nutch does not send a HTTP Accept Header. Is that
on purpose?
I would have expected that Nutch tells the server which mime-types it
accepts resp. is able to parse and index,
but maybe I misunderstand something
(dir.toString()).exists()) {
+LOG.warn(No such directory: + new
java.io.File(dir.toString()));
+}
Path servers = new Path(dir, search-servers.txt);
if (fs.exists(servers)) {
if (LOG.isInfoEnabled()) {
WDYT?
Thanks
Michi
--
Michael Wechner
Wyona
Hasan Diwan wrote:
On 25/08/06, Michael Wechner [EMAIL PROTECTED] wrote:
Index: nutch-0.8/src/java/org/apache/nutch/searcher/NutchBean.java
===
--- nutch-0.8/src/java/org/apache/nutch/searcher/NutchBean.java
(Revision 436787
just commented, hence the minor
difference of the two slashes ;-)
HTH
Michi
Otis
- Original Message
From: Michael Wechner [EMAIL PROTECTED]
To: nutch-dev@lucene.apache.org
Sent: Tuesday, August 22, 2006 9:07:12 AM
Subject: Ontology compile bug
Hi
It seems to me that refine-query
/additional questions/whatever on this subject is
appreciated as I would like to come up with a more optimal solution for us
intranet nutch users.
Ben
--
Michael Wechner
Wyona - Open Source Content Management -Apache Lenya
http://www.wyona.com http://lenya.apache.org
: org.apache.nutch.parse.text.TextParser mapped to
contentType application/xhtml+xml via parse-plugins.xml, but its
plugin.xml file does not claim to support contentType: application/xhtml+xml
Can anyone confirm this resp. shall I add a bug entry?
Thanks
Michi
--
Michael Wechner
Wyona - Open Source Content Management
, but what do you mean by intranet and internet
crawling?
In the end both of them are just URLs ... right? It seems to me I
completely misunderstand something.
Thanks for a hint
Michi
--
Michael Wechner
Wyona - Open Source Content Management -Apache Lenya
http://www.wyona.com
Michael Wechner wrote:
Sami Siren wrote:
Michael Wechner wrote:
Hi
It seems to me that Nutch 0.8.x cannot extract the title from an XHTML
page, e.g.
Try changing the following in your parse-plugins.xml
mimeType name=application/xhtml+xml
plugin id=parse-html
Michael Wechner wrote:
I have added a patch
https://issues.apache.org/jira/secure/ManageAttachments.jspa?id=12359202
sorry, I actually meant
https://issues.apache.org/jira/browse/NUTCH-418
Cheers
Michi
Thanks
Michi
Cheers
Michi
--
Sami Siren
--
Michael Wechner
Krebs, Urs wrote:
Hi list,
I made a tool who can modify and create .owl files as synonymlists.
I don't know where to put. May I add it to jira
I think JIRA would be good for a start
Cheers
Michael
or could you use it
directly in nutch?
Urs
--
Michael Wechner
Wyona
a table of contents
Cheers
Michi
Dennis Kubes
--
Michael Wechner
Wyona - Open Source Content Management -Apache Lenya
http://www.wyona.com http://lenya.apache.org
[EMAIL PROTECTED][EMAIL PROTECTED]
+41 44 272 91 61
than the HTML itself
?xml version=1.0?
semantic-of href=index.html
...
/semantic-of
resp. some RDF or whatever.
Any pointers are very welcome.
Thanks
Michi
--
Michael Wechner
Wyona - Open Source Content Management -Apache Lenya
http://www.wyona.com http
: GnuPG v1.4.7 (Darwin)
iD8DBQFF812mgz0R1bg11MERAqXCAKCVTfLN7KXJYdAqLGWMI57ChKaM8QCfdQBc
1CyrQfD+5vCzSBvYbviX17o=
=+TK/
-END PGP SIGNATURE-
--
Michael Wechner
Wyona - Open Source Content Management -Apache Lenya
http://www.wyona.com http
Michael Wechner wrote:
d e wrote:
I'm sorry! I guess I was REALLY not clear. I mean my problem is to
drop the
junk *on each page*. I am indexing news sites. I want to harvest news
STORIES, not the advertisements and other junk text around the
outside of
each page. Got suggestions
Reporter: Michael Wechner
Fixes parsing of XHTML (e.g. title)
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http
[ http://issues.apache.org/jira/browse/NUTCH-418?page=all ]
Michael Wechner updated NUTCH-418:
--
Attachment: parse-xhtml-patch.txt
patch which fixes the mime-type
Fixes parsing of XHTML (e.g. title
22 matches
Mail list logo