#363: Number of results show in XML-ish formats
------------------------+---------------------------------------------------
  Reporter:  tbrooks    |       Owner:     
      Type:  defect     |      Status:  new
  Priority:  major      |   Milestone:     
 Component:  BibFormat  |     Version:     
Resolution:             |    Keywords:     
------------------------+---------------------------------------------------

Comment (by arwagner):

 The point that such large lists are ''necessary'' and ''need to be
 served'' is clear as crystal and there needs to be found some way to serve
 them. IMHO ths can actually easily accomplished by formats like BibTeX or
 EndNote Tagged (RIS). Just dump it out in XML style formats might not
 result in what the user expects here: if your XML parser takes hours and
 GBs of RAM to process such large structures it makes no real fun.
 Therefore it might be a good idea to guide users to a better suited format
 for such large lists.

 Some observation, as I have similar usecases in a bibliometric context.
 This deals not only with publications lists of well known researchers but
 whole institutions (or even countries). Therefore, I sometimes need to
 process even beyond 10.000 records at a time. Usually, I get them by some
 web service interfaces from some database (''fast'' would another story,
 but that's life...). The observations here are pretty simple. First, XML
 is way to chatty resulting in large amounts of data that need to be
 transferred. From the endusers point of view this can be neglected in the
 CV usecase on current hardware. But imagine the publications lists of
 SLAC, there the story can get interesting already. From the servers point
 of view, this might be an issue. The next observation is if you want to
 process a decent amount of records containing decent bibliographic data
 (and maybe even links to the references) in XML, you should definitely
 limit the chunk. On a usual desktop PC ~100 records are, to my
 observations, ok. You can go a bit higher here, but fun decreases
 dramatically with the number of records processed at once, and the
 decrease in fun is not linear. Depending on who processes the records
 you'll also have to consider a usual desktop computer not a workstation.
 Throwing enough computing power at it may lift the upper limit a bit.

 A look at "the competitors" (be it orange, green or blue) might be in
 order here. For their web services (this is another usage, I'm well aware
 of that, but the xml-issue is the same) you've usually limits to 100
 records per query. In case you want more you've to implement loading
 several chunks on the client side and you have to add decent pauses
 between your requests otherwise you'll trigger the robot detection system
 preventing overloading the database. Similarly, in OAI-PMH you get only a
 limited number of records per chunk. However, if you want to download
 bibliographic data from their web interface you can get 500 records in
 EndNote Tagged at least and without any issues. So the chunk for "not XML"
 is 5 times as large. One could imagine this has some meaning.

-- 
Ticket URL: <http://invenio-software.org/ticket/363#comment:6>
Invenio <http://invenio-software.org>

Reply via email to