#363: Number of results show in XML-ish formats
------------------------+---------------------------------------------------
Reporter: tbrooks | Owner:
Type: defect | Status: new
Priority: major | Milestone:
Component: BibFormat | Version:
Resolution: | Keywords:
------------------------+---------------------------------------------------
Comment (by arwagner):
The point that such large lists are ''necessary'' and ''need to be
served'' is clear as crystal and there needs to be found some way to serve
them. IMHO ths can actually easily accomplished by formats like BibTeX or
EndNote Tagged (RIS). Just dump it out in XML style formats might not
result in what the user expects here: if your XML parser takes hours and
GBs of RAM to process such large structures it makes no real fun.
Therefore it might be a good idea to guide users to a better suited format
for such large lists.
Some observation, as I have similar usecases in a bibliometric context.
This deals not only with publications lists of well known researchers but
whole institutions (or even countries). Therefore, I sometimes need to
process even beyond 10.000 records at a time. Usually, I get them by some
web service interfaces from some database (''fast'' would another story,
but that's life...). The observations here are pretty simple. First, XML
is way to chatty resulting in large amounts of data that need to be
transferred. From the endusers point of view this can be neglected in the
CV usecase on current hardware. But imagine the publications lists of
SLAC, there the story can get interesting already. From the servers point
of view, this might be an issue. The next observation is if you want to
process a decent amount of records containing decent bibliographic data
(and maybe even links to the references) in XML, you should definitely
limit the chunk. On a usual desktop PC ~100 records are, to my
observations, ok. You can go a bit higher here, but fun decreases
dramatically with the number of records processed at once, and the
decrease in fun is not linear. Depending on who processes the records
you'll also have to consider a usual desktop computer not a workstation.
Throwing enough computing power at it may lift the upper limit a bit.
A look at "the competitors" (be it orange, green or blue) might be in
order here. For their web services (this is another usage, I'm well aware
of that, but the xml-issue is the same) you've usually limits to 100
records per query. In case you want more you've to implement loading
several chunks on the client side and you have to add decent pauses
between your requests otherwise you'll trigger the robot detection system
preventing overloading the database. Similarly, in OAI-PMH you get only a
limited number of records per chunk. However, if you want to download
bibliographic data from their web interface you can get 500 records in
EndNote Tagged at least and without any issues. So the chunk for "not XML"
is 5 times as large. One could imagine this has some meaning.
--
Ticket URL: <http://invenio-software.org/ticket/363#comment:6>
Invenio <http://invenio-software.org>