Hi Claudio,
On Jan 25, 2010, at 8:41am, Claudio Martella wrote:
Hello list,
I'm writing a webcrawler and using apache tika to extract text out of
certain mime-types.
what i'm actually doing right now is:
while(toVisit.size()){
[... SOME CODE HANDLING LISTS AND HTTPCLIENT ...]
client.executeMethod(method);
// just i.e. returns "text/html" or "application/pdf"
String mime = getContentType(method);
// extract text where possible and send it to the index
if(supportedContentType.contains(mime))
pool.execute(new MyContentHandler(new
CrawledResult(method.getResponseBodyAsStream(), workingURL, null,
mime)));
else {
String htmlBody = method.getResponseBodyAsString();
}
[... SOME CODE TO EXTRACT LINKS OUT OF htmlBody AND
POPULATE toVisit SET ...]
method.releaseConnection();
}
My problem is that before my Threads can access the content of the
Stream, the crawler is calling the releaseConnection() which leads
to an
IO Exception.
I'd like to avoid passing the whole HttpMethod and delegating its
management to my other Classes which should just know about IO
Streams.
Do you have any idea on how i could handle this concurrency problem?
Maybe some cloning?
If you really want to avoid passing explicit HttpClient classes to
your thread, then the best option I can think of is to define an
IHttpFetcher interface that masks the underlying implementation, and
implement this interface with a specific HttpClient version.
This is similar to what I did for Bixo. See
http://github.com/bixo/bixo/blob/master/src/main/java/bixo/fetcher/http/SimpleHttpFetcher.java
-- Ken
--------------------------------------------
<http://ken-blog.krugler.org>
+1 530-265-2225
--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c w e b m i n i n g
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]