Hi Andy and St.Ack,
I'd be interested to hear if logging turns up anything.
Table commits have sub-second response times. It looks like crawling is causing the slowness.
Inside the map task definitely. Job failure at the map stage would force you to redo anything that might be in the collector.
I am putting data in the same row and column family as i am scanning. According the St.Ack's response, I need to put the data in a separate column family. I will see if this helps. I'm curious, does the commit write the data to the same region as the map task is scanning? Is this what may cause contention?
In general crawling, especially if you are recursively following links (are you?), can take a long time... Often remote servers are quite slow. I set a socket and connection timeout for commons-httpclient and retry.
I am lucky, I do not need to recurse links. Can you disclose what settings you are using for the commons-httpclient?
Another possibility is to run a MR job ahead of time to build a worklist in DFS and avoid use of TableMap entirely. This would also allow you to split the work into more maps.
I may end up doing this if I find that the number of MR tasks is not sufficient.
Is there a way to split the regions before the MR task runs? I know it is going to write ~2K per row, is there a way to tell HBase to go ahead and split based on this anticipated size?
thanks, Dru
