SIGNATURE-
Version: GnuPG v1.4.7 (Darwin)
iD8DBQFF812mgz0R1bg11MERAqXCAKCVTfLN7KXJYdAqLGWMI57ChKaM8QCfdQBc
1CyrQfD+5vCzSBvYbviX17o=
=+TK/
-END PGP SIGNATURE-
--
Michael Wechner
Wyona - Open Source Content Management -Apache Lenya
http://www.wyona.com
Michael Wechner wrote:
d e wrote:
I'm sorry! I guess I was REALLY not clear. I mean my problem is to
drop the
junk *on each page*. I am indexing news sites. I want to harvest news
STORIES, not the advertisements and other junk text around the
outside of
each page. Got suggestions
than the HTML itself
?xml version=1.0?
semantic-of href=index.html
...
/semantic-of
resp. some RDF or whatever.
Any pointers are very welcome.
Thanks
Michi
--
Michael Wechner
Wyona - Open Source Content Management -Apache Lenya
http://www.wyona.com http
Krebs, Urs wrote:
Hi list,
I made a tool who can modify and create .owl files as synonymlists.
I don't know where to put. May I add it to jira
I think JIRA would be good for a start
Cheers
Michael
or could you use it
directly in nutch?
Urs
--
Michael Wechner
Wyona - Open
Reporter: Michael Wechner
Fixes parsing of XHTML (e.g. title)
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http
[ http://issues.apache.org/jira/browse/NUTCH-418?page=all ]
Michael Wechner updated NUTCH-418:
--
Attachment: parse-xhtml-patch.txt
patch which fixes the mime-type
Fixes parsing of XHTML (e.g. title
Michael Wechner wrote:
Sami Siren wrote:
Michael Wechner wrote:
Hi
It seems to me that Nutch 0.8.x cannot extract the title from an XHTML
page, e.g.
Try changing the following in your parse-plugins.xml
mimeType name=application/xhtml+xml
plugin id=parse-html
Michael Wechner wrote:
I have added a patch
https://issues.apache.org/jira/secure/ManageAttachments.jspa?id=12359202
sorry, I actually meant
https://issues.apache.org/jira/browse/NUTCH-418
Cheers
Michi
Thanks
Michi
Cheers
Michi
--
Sami Siren
--
Michael Wechner
: org.apache.nutch.parse.text.TextParser mapped to
contentType application/xhtml+xml via parse-plugins.xml, but its
plugin.xml file does not claim to support contentType: application/xhtml+xml
Can anyone confirm this resp. shall I add a bug entry?
Thanks
Michi
--
Michael Wechner
Wyona - Open Source Content Management
Sami Siren wrote:
Michael Wechner wrote:
Hi
It seems to me that Nutch 0.8.x cannot extract the title from an XHTML
page, e.g.
Try changing the following in your parse-plugins.xml
mimeType name=application/xhtml+xml
plugin id=parse-html /
/mimeType
, but what do you mean by intranet and internet
crawling?
In the end both of them are just URLs ... right? It seems to me I
completely misunderstand something.
Thanks for a hint
Michi
--
Michael Wechner
Wyona - Open Source Content Management -Apache Lenya
http://www.wyona.com
questions/whatever on this subject is
appreciated as I would like to come up with a more optimal solution for us
intranet nutch users.
Ben
--
Michael Wechner
Wyona - Open Source Content Management -Apache Lenya
http://www.wyona.com http://lenya.apache.org
[EMAIL
just commented, hence the minor
difference of the two slashes ;-)
HTH
Michi
Otis
- Original Message
From: Michael Wechner [EMAIL PROTECTED]
To: nutch-dev@lucene.apache.org
Sent: Tuesday, August 22, 2006 9:07:12 AM
Subject: Ontology compile bug
Hi
It seems to me that refine-query
Hasan Diwan wrote:
On 25/08/06, Michael Wechner [EMAIL PROTECTED] wrote:
Index: nutch-0.8/src/java/org/apache/nutch/searcher/NutchBean.java
===
--- nutch-0.8/src/java/org/apache/nutch/searcher/NutchBean.java
(Revision
(dir.toString()).exists()) {
+LOG.warn(No such directory: + new
java.io.File(dir.toString()));
+}
Path servers = new Path(dir, search-servers.txt);
if (fs.exists(servers)) {
if (LOG.isInfoEnabled()) {
WDYT?
Thanks
Michi
--
Michael Wechner
Wyona
a misconfigured searcher.dir either.
So, it can be very confusing, especially for beginners, because one
starts scratching and looking what might be the problem and actually
the problem is quite simple.
Enough motivation ;-) ?
HTH
Michi
Stefan
Am 25.08.2006 um 06:52 schrieb Michael Wechner:
Hi
I think
Sami Siren wrote:
Michael Wechner wrote:
Hi
It seems to me that Nutch does not send a HTTP Accept Header. Is that
on purpose?
I would have expected that Nutch tells the server which mime-types it
accepts resp. is able to parse and index,
but maybe I misunderstand something
Hi
It seems to me that Nutch does not send a HTTP Accept Header. Is that on
purpose?
I would have expected that Nutch tells the server which mime-types it
accepts resp. is able to parse and index,
but maybe I misunderstand something.
Thanks
Michi
--
Michael Wechner
Wyona - Open
to test the crawl more quickly than first having to
setup the WAR file.
WDYT?
Thanks
Michi
--
Michael Wechner
Wyona - Open Source Content Management -Apache Lenya
http://www.wyona.com http://lenya.apache.org
[EMAIL PROTECTED][EMAIL
--
Yukatan - http://yukatan.fi/ - [EMAIL PROTECTED]
Software craftsmanship, JCR consulting, and Java development
--
Michael Wechner
Wyona - Open Source Content Management -Apache Lenya
http://www.wyona.com http://lenya.apache.org
[EMAIL PROTECTED
starts crawling the site.
Michi
--
Michael Wechner
Wyona - Open Source Content Management -Apache Lenya
http://www.wyona.com http://lenya.apache.org
[EMAIL PROTECTED][EMAIL PROTECTED]
+41 44 272 91 61
...
--
Michael Wechner
Wyona - Open Source Content Management -Apache Lenya
http://www.wyona.com http://lenya.apache.org
[EMAIL PROTECTED][EMAIL PROTECTED]
---
SF.Net email is sponsored by:
Tame
please apologize for sending this private message to the mailing list.
Thanks
Michi
Michael Wechner wrote:
Hey Gavin
It's quite some time since we met in San Francisco.
How are you? Hope all is well.
All the best
Michael
Gavin Thomas Nicol wrote:
On Sep 21, 2005, at 11:55 AM, Jack
Doug Cutting wrote:
Should I send a final notice asking folks to join the Apache list, and
then shut down the sourceforge list?
Well, I think people will move quickly to the Apache ML if the
sourceforge one is being shut down ;-)
Michi
--
Michael Wechner
Wyona Inc. - Open Source Content
Doug Cutting wrote:
Michael Wechner wrote:
one needs to start the servlet container within the directory where
the segments directory is located.
Because I often forget this I receive a NullPointerException and then
after some time I remember that I started Tomcat from the wrong
directory.
I
a reason that [EMAIL PROTECTED] still exists?
Thanks
Michi
--
Michael Wechner
Wyona Inc. - Open Source Content Management - Apache Lenya
http://www.wyona.com http://lenya.apache.org
[EMAIL PROTECTED][EMAIL PROTECTED
Doug Cutting wrote:
Michael Wechner wrote:
Doug Cutting commented on NUTCH-42:
---
I prefer we have a servlet that generates only XML, and then
generate HTML from this XML. Do you dislike that approach for some
reason?
no, not at all. I much more prefer
started Tomcat from the wrong directory.
I think it would make sense if the segments directory could be specified
within
the web.xml and also in case the segments directory cannot be found a
nice Exception
would be thrown telling one what might be wrong.
WDYT?
Thanks
Michi
--
Michael Wechner
[ http://issues.apache.org/jira/browse/NUTCH-42?page=history ]
Michael Wechner updated NUTCH-42:
-
Attachment: search.jsp.diff
Add RSS link to search.jsp poiting to the OpenSearch servlet
enhance search.jsp such that it can also returns XML
search.jsp such that it can also returns XML
Key: NUTCH-42
URL: http://issues.apache.org/jira/browse/NUTCH-42
Project: Nutch
Type: Wish
Components: web gui
Reporter: Michael Wechner
Priority: Trivial
Attachments
Replace CVS by SVN within tutorial of Documentation
---
Key: NUTCH-41
URL: http://issues.apache.org/jira/browse/NUTCH-41
Project: Nutch
Type: Bug
Reporter: Michael Wechner
Priority: Trivial
Attachments
[ http://issues.apache.org/jira/browse/NUTCH-41?page=history ]
Michael Wechner updated NUTCH-41:
-
Attachment: tutorial.xml.diff
Replace CVS by SVN within tutorial of Documentation
---
Key
enhance search.jsp such that it can also returns XML
Key: NUTCH-42
URL: http://issues.apache.org/jira/browse/NUTCH-42
Project: Nutch
Type: Wish
Components: web gui
Reporter: Michael Wechner
Priority
to Apache.
WDYT?
Michi
--
Michael Wechner
Wyona Inc. - Open Source Content Management - Apache Lenya
http://www.wyona.com http://lenya.apache.org
[EMAIL PROTECTED][EMAIL PROTECTED]
---
SF email
--
Michael Wechner
Wyona Inc. - Open Source Content Management - Apache Lenya
http://www.wyona.com http://lenya.apache.org
[EMAIL PROTECTED][EMAIL PROTECTED]
---
SF email is sponsored
Doug Cutting wrote:
as svn:ignore parameters
Done.
that was quick :-)
Thanks
Michi
--
Michael Wechner
Wyona Inc. - Open Source Content Management - Apache Lenya
http://www.wyona.com http://lenya.apache.org
[EMAIL PROTECTED][EMAIL PROTECTED
this would make crawling obsolete to a certain point (at least
for pages
being created by content management systems).
Thanks
Michi
--
Michael Wechner
Wyona Inc. - Open Source Content Management - Apache Lenya
http://www.wyona.com http://lenya.apache.org
[EMAIL PROTECTED
:[EMAIL PROTECTED] On Behalf Of Michael
Wechner
Sent: Wednesday, March 16, 2005 6:09 PM
To: [EMAIL PROTECTED]
Subject: [Nutch-dev] Starting a non-profit organisation running Nutch with a
thousand or more sponsored servers
Hi
I was recently thinking that it would be fun to start a non-profit
, because it's a very central
point re the web.
Thanks
Michi
Otis
--- Michael Wechner [EMAIL PROTECTED] wrote:
Hi
I was recently thinking that it would be fun to start a non-profit
organization
in order to run Nutch as a really transparent and open search
engine, very
similar as for instance
=6595alloc_id=14396op=click
___
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers
--
Michael Wechner
Wyona Inc. - Open Source Content Management - Apache Lenya
http
Michael Wechner wrote:
could help here to make sure, that the organization would dissolve
sorry, I meant would not
because of money or other issues, but I guess that's another challenge
;-)
Michi
--
Michael Wechner
Wyona Inc. - Open Source Content Management - Apache Lenya
http
David Spencer wrote:
Michael Wechner wrote:
Stefan Groschupf wrote:
have you collected these offers somewhere?
Check the source-forge mail archive.
thanks, will do.
btw, is there an interface within Nutch, where a CMS (e.g. Apache
Lenya) can notify Nutch about content changes (or deletion
to challenge for instance Google. A 1000 servers is quite a lot of
money, but
one server is affordable by all kind of people and companies.
I am aware that servers is not the only thing, but I would be interested
what the community thinks about such an infrastructure project.
Thanks
Michi
--
Michael
43 matches
Mail list logo