Re: TU Berlin Winter of Code Project - II. Layer: Preprocessing

Ken Krugler Sun, 29 Nov 2009 07:02:43 -0800

There are two separate issues here - HTML parsing (sometimes calledcleanup) vs. getting rid of boilerplate content, which is also oftencalled HTML cleanup.

TagSoup & NekoHTML are examples of the former - code that "fixes up"HTML documents so you can apply standard XML parsing techniques.

The articles originally referenced below, as well as my prior noteabout nCleaner, are talking about the latter - trying to get rid ofheaders, footers, ads, etc.


-- Ken

On Nov 28, 2009, at 12:30pm, Marc Hofer wrote:

Hi Drew,
currently we are using a HTML Filter module of the UniveristyDuisburg-Essen, that can be found here: http://www.is.informatik.uni-duisburg.de/projects/java-unidu/filter.html
Another idea was to try Jericho or NekoHTML.
http://www.java2s.com/Product/Java/Development/HTML-Parser.htm
Thanks for your advice, we will test it and let you know, whether itworks well.
Marc

Drew Farris schrieb:
Hi Marc,
How are you planning on cleaning up the HTML documents?
Perhaps something like this would be useful: I came across an
interesting approach a few days ago, it would be interesting to hear
more from someone who has tried something like this:
http://ai-depot.com/articles/the-easy-way-to-extract-useful-text-from-arbitrary-html/
Described further, with java implementations here:
http://sujitpal.blogspot.com/2009/11/extracting-useful-text-from-html.html
Drew
On Sat, Nov 28, 2009 at 2:57 PM, Marc Hofer <m...@marc-hofer.de>wrote:
Hello everybody,
having already presented the draft of our architecture, I wouldlike now todiscuss the second layer more in detail. As mentioned before wehave chosenUIMA for this layer. The main aggregate currently consists of theWhitespaceTokenizer Annotator, the Snowball Annotator (Stemming) and a list-basedStopwordFilter. Before processing this aggregate in a map-only jobinHadoop, we want to filter all HTML tags and forward only thispreprocesseddata to the aggregate. The reason for this is that it is difficultto changethe document during processing in UIMA and it is impractical towork all the
time on documents containing HTML tags.
Furthermore we are planning to add the Tagger Annotator, whichimplements aHidden Markov Model tagger. Here we aren't sure, which tokens withtheircorresponding part of speech tags to delete or not and so usingthem for thefeature extraction. One purpose could be to use at the verybeginning only
substantives and verbs.
We are very interested in your comments and remarks and it wouldbe nice to
hear from you.

Cheers,
Marc


--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g

Re: TU Berlin Winter of Code Project - II. Layer: Preprocessing

Reply via email to