Honestly, I haven't modified any crawler code at all. Are you sure you
entered a url with an trailing slash in the seed list? I tried to skip
that slash, and then the crawler began to act strangely. I cannot
reproduce your results.
This is my settings:
Seed: http://ridder.uio.no/
Inclusions: ^http://ridder.uio.no/.* (marked "include only host matching
...")
Everything works like a dream. The only problem I have with the PDF
document is that it does not parse the Norwegian characters correctly,
but this can be a Tika bug since all other document formats are parsed
correctly.
BTW: I did a svn update, ant clean -> build, and now the document with
the noindex rule is skipped. Great. Thanks a zillion!
And regarding the Solr trick with the jar files I had to move manually
since they were excluded from solr.jar (my last home lesson):
- When Solr is running in a servlet container such as Resin, you have to
move the following jars manually into the <solr.home>/lib directory in
order to enable the ExtractingRequestHandler:
- apache-solr-cell-*.jar
- the other Tika jars
You will find the same information in the following file:
solr_trunk/solr/contrib/extraction/CHANGES.txt.
Erlend
On 02.02.11 17.34, Karl Wright wrote:
Turns out Java doesn't like the form of those URLs; it doesn't they're proper:
WEB: Can't use url 'dokument.pdf' because it is badly formed: Relative
path in absolute URI: http://ridder.uio.nodokument.pdf
WEB: In html document 'http://ridder.uio.no', found an unincluded URL
'dokument.pdf'
This is the java.net.URI class:
java.net.URI parentURL = new java.net.URI(parentIdentifier);
url = parentURL.resolve(rawURL);
... and this is throwing a java.net.URISyntaxException.
I'm going to have to go look at the standards to figure out what we
should do here. Perhaps the right approach is to note the exception
and retry with a "/" glommed on the front if we get it.
But clearly you must have modified the web connector in order to get
it to crawl your stuff in the first place.
Karl
On Wed, Feb 2, 2011 at 11:08 AM, Karl Wright<daddy...@gmail.com> wrote:
Hmm. I get 701 bytes from your seed, but no parseable links. Investigating...
Karl
On Wed, Feb 2, 2011 at 10:45 AM, Erlend Garåsen<e.f.gara...@usit.uio.no> wrote:
On 28.01.11 14.32, Karl Wright wrote:
Thanks. I tested my changes enough so that I was confident in
committing the patch, so the changes are in trunk.
I'm afraid that it doesn't work properly. I downloaded the latest version
from trunk and started the crawler.
Try to use the following address in your seed list and the following rule in
the includes list:
^http://ridder.uio.no/.*
The following document was fetched and sent to Solr for indexing even though
it includes a robots noindex rule:
http://ridder.uio.no/test_closed/
Here's the line from the history telling me that Sole should index it:
02-02-2011 16:12:33.283 document ingest (Solr)
http://ridder.uio.no/test_closed/
200
I can try to modify the code you have added in order to get around this
tomorrow. I guess I can find the relevant check somewhere in the following
folder?
mcf-trunk/connectors/webcrawler/connector/src/main/java/org/apache/manifoldcf/crawler/connectors/webcrawler
Erlend
--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050