I just coded a Java port of the arclabs 'readability' javascript code, which has a very strong reputation as a device for grabbing the useful content from newsy web pages.
I could contribute it to Tika, if (a) you wanted it, and (b) there was some reasonable way to decide or configure which one to use.
