Hi Claudio,

On Jan 25, 2010, at 8:41am, Claudio Martella wrote:

Hello list,

I'm writing a webcrawler and using apache tika to extract text out of
certain mime-types.

what i'm actually doing right now is:

   while(toVisit.size()){

               [... SOME CODE HANDLING LISTS AND HTTPCLIENT ...]


               client.executeMethod(method);
               // just i.e. returns "text/html" or "application/pdf"
               String mime = getContentType(method);

               // extract text where possible and send it to the index
               if(supportedContentType.contains(mime))
                   pool.execute(new MyContentHandler(new
CrawledResult(method.getResponseBodyAsStream(), workingURL, null, mime)));
               else {
                   String htmlBody = method.getResponseBodyAsString();
               }


               [... SOME CODE TO EXTRACT LINKS OUT OF htmlBody AND
POPULATE toVisit SET ...]

               method.releaseConnection();
   }

My problem is that before my Threads can access the content of the
Stream, the crawler is calling the releaseConnection() which leads to an
IO Exception.

I'd like to avoid passing the whole HttpMethod and delegating its
management to my other Classes which should just know about IO Streams.
Do you have any idea on how i could handle this concurrency problem?
Maybe some cloning?

If you really want to avoid passing explicit HttpClient classes to your thread, then the best option I can think of is to define an IHttpFetcher interface that masks the underlying implementation, and implement this interface with a specific HttpClient version.

This is similar to what I did for Bixo. See 
http://github.com/bixo/bixo/blob/master/src/main/java/bixo/fetcher/http/SimpleHttpFetcher.java

-- Ken

--------------------------------------------
<http://ken-blog.krugler.org>
+1 530-265-2225






--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g





---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to