Hi Pierre I had a similar problem years ago. My first input was a file from the Australian Department of Defence, listing about 77,000 businesses who provided supplies to the Dept. I had to find their unique Australian Business Numbers (ABNs). Some of the businesses already had their ABNs, so I had to repeatedly search a second input - an Australian Federal Government web site, i.e. do a search using the business name. I ended up with 33,000 HTML pages, which were then processed offline.
Things I noted: o Search failures are of various types: - Name not found. So the format of this is different from the format of a successful search. - Network offline. During an over-weekend run the Canberra bushfires affected many things, and the web site was down. - Local network issues. I was working at Deakin University in Melbourne at the time (which is in a different state from Canberra). - But these issues are minor. o In case you're wondering about the load on the remote server, of course I worked up to it: = Write code to wait for a few seconds between each request. There is no hurry. In my case all hits are on the same server, but you don't have that issue. = Run a test on 1 business = Then on 10 = Then on 100 = Then on 1,000 = Then on the remainder o Successes displayed something easily found with the human eye, but not for a program: - The results were buried in tables 10 (ten) deep. And yes, this pathological structure was of course generated by them using a Microsoft tool (name forgotten/not known). o Although I have not had any dealings with that Dept, except to be given that MS Access file, I do appreciate living in a country where such a thing can happen - When I went to Singapore to give a paper on this work, there was total silence in the audience when I oh-so-casually mentioned the source of our data, and the generosity of the Dept. - The other talks I listened to were very sad. The speakers (and not just from China/Singapore) knew that they could never have such a chance. Their papers discussed /what they might theoretically do if they ever had such data/, because that was a close to live data as they were going to get outside their dreams. o So, how much is relevant to your work: - Possibly not much - The result (good/bad) was always structured in tables, and did not change more than once during the whole process. - Your problem sounds definitely harder, given the variety of the formats. - You talk about being lazy, which of course is easy to say, but actually the best design would be to roll multiple tasks into one clever program. - And that contradicts where you switch to talking about plug-ins. But I suspect the latter would make it easier to cope with the variety of formats. - I am surprised re your comments on the data you want being so easy to extract. I bet it turns out to be much more labourious to correctly identify which data is the desired temperature data associated with each given city. Cheers Ron -- You received this message because you are subscribed to the Google Groups "marpa parser" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
