Hi Paragraph from their website "Our solution is to run (Java) regular expressions against each webpages prior to extraction, which detect the presence of a microformat in a HTML page, and then only run the Any23 extractor when the regular expression find potentional matches."
Are we using any technics like that to decide that there is anything to parse in the document ? Maybe we can build in such feature like a method/filter for users that want to parse huge number of docs to detect that the document is worth parsing They have the table with regex they used for each format Any opinions about this Szymon On 23 March 2012 10:38, Davide Palmisano <[email protected]> wrote: > Thanks Michele, > > this is a great news. > > Should we have a section on the web site listing > all the products/initiatives that are using Any23? > > On Fri, Mar 23, 2012 at 11:01 AM, Michele Mostarda > <[email protected]> wrote: >> Hi Guys, >> >> just a curiosity: >> >> Any23 has been recently used to parse the entire corpus of Semantic >> Web Data existing on the Web [0]. >> >> The best. >> >> Mic >> >> [0] http://webdatacommons.org/ >> >> -- >> Michele Mostarda >> Senior Software Engineer >> skype: michele.mostarda >> twitter: micmos >> mail: [email protected] >> site : http://www.michelemostarda.com > > > > -- > Davide Palmisano > > http://davidepalmisano.com > http://twitter.com/dpalmisano
