Hi Pierre

I had a similar problem years ago. My first input was a file from the 
Australian Department of Defence, listing about 77,000 businesses who 
provided supplies to the Dept. I had to find their unique Australian 
Business Numbers (ABNs).
Some of the businesses already had their ABNs, so I had to repeatedly 
search a second input - an Australian Federal Government web site, i.e. do 
a search using the business name. I ended up with 33,000 HTML pages, which 
were then processed offline.

Things I noted:

o Search failures are of various types:
- Name not found. So the format of this is different from the format of a 
successful search.
- Network offline. During an over-weekend run the Canberra bushfires 
affected many things, and the web site was down.
- Local network issues. I was working at Deakin University in Melbourne at 
the time (which is in a different state from Canberra).
- But these issues are minor.

o In case you're wondering about the load on the remote server, of course I 
worked up to it:
= Write code to wait for a few seconds between each request. There is no 
hurry. In my case all hits are on the same server, but you don't have that 
issue.
= Run a test on 1 business
= Then on 10
= Then on 100
= Then on 1,000
= Then on the remainder

o Successes displayed something easily found with the human eye, but not 
for a program:
- The results were buried in tables 10 (ten) deep. And yes, this 
pathological structure was of course generated by them using a Microsoft 
tool (name forgotten/not known).

o Although I have not had any dealings with that Dept, except to be given 
that MS Access file, I do appreciate living in a country where such a thing 
can happen
- When I went to Singapore to give a paper on this work, there was total 
silence in the audience when I oh-so-casually mentioned the source of our 
data, and the generosity of the Dept.
- The other talks I listened to were very sad. The speakers (and not just 
from China/Singapore) knew that they could never have such a chance. Their 
papers discussed /what they might theoretically do if they ever had such 
data/,
because that was a close to live data as they were going to get outside 
their dreams.

o So, how much is relevant to your work:
- Possibly not much
- The result (good/bad) was always structured in tables, and did not change 
more than once during the whole process.
- Your problem sounds definitely harder, given the variety of the formats.
- You talk about being lazy, which of course is easy to say, but actually 
the best design would be to roll multiple tasks into one clever program.
- And that contradicts where you switch to talking about plug-ins. But I 
suspect the latter would make it easier to cope with the variety of formats.
- I am surprised re your comments on the data you want being so easy to 
extract. I bet it turns out to be much more labourious to correctly 
identify which data is the desired temperature data associated with each 
given city.

Cheers
Ron


-- 
You received this message because you are subscribed to the Google Groups 
"marpa parser" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to