RE: [inspire-dev] Several questions about search engine of Invenio and possible extensions

Piotr Praczyk Wed, 29 Jun 2011 18:02:00 +0200

Hi


>Il giorno mer, 29/06/2011 alle 12.47 +0000, Piotr Praczyk ha scritto:
>> Some time ago Tibor mentioned the possibility of enabling the BibIndex
>> to take into account some fields from MoreInfo entries in BibDoc.
>> There was also a discussion about possibility of splitting namespace
>> of records into serveral independent chunks (for example having
>> records from 10^8 - 10^12  be assigned to different class of records).
>> These two features would be very useful for the use case of data
>> preservation in Inspire (and also figures infrastructure would benefit
>> from them).
>Yup! Another nice discussion for project-cdsware-developers :-)
So I CC the group


>IMHO, there is a philosophical issue here. In Invenio, up to now, the
>unit of information has always been the record. Everything is built and
>optimized around it. 

I am completely aware of this, but at the moment, we decided that we do not 
want to upload figures as records,
we decided that we have to provide enough infrastructure for non-marc entities.

>BibIndex creates indexes where to
>words/pairs/phrases are pointing to an intbitset containing record ids.
>An intbitset, altough extremely fast, when loaded in memory will fill up
>as many bytes as the biggest ID stored, divided by 8. So if you plan to
>store IDs such as 10^8 this would lead to enormous intbitsets, actually
>using at least 12.5MB each, even in the case there is only 1 ID.

10^8 is completely reasonable number (for example when talking about figures).
Nonetheless, I think the situation is not that bad. If there was splitting of 
namespaces, BibIndex could create intbitsets for separate "namespaces" (ranges 
of ID's).
In this case we would have the biggest intbitset of the size of the biggest 
cluster / 8.

>> We need to be able to store custom objects that are not attached to
>> publications. (Another discussion about technicalities of extending
>> BibDocFile infrastructure). There objects would be prefect for storing
>> data files that should be preserved.
>> The objects should be searchable and have their splash pages (that
>> could be registered in the organisation that manages DOIs)
>If they are searchable doesn't it mean they have metadata? Aren't in
>that case worth having a record representing them?

Indeed they have meta-data, but first, we do not want to show all the metadata 
(could be solved by hiding fields).
Second, the meta-data is not natively MARC so we would have to provide 
transformations both ways between for example DataCite and MARC.
Moreover we would have to standarise MARC. (I am scared of even thinking of 
this as most of datafieds we need are not predicted in the standard)

Third, the MoreInfo is effectlively sotring Meta-Data and we were talking with
Tibor about enabling the possibility of indexing some fields in the future. The 
situation is not much different here.

>> As far as I know, BibIndex creates simple maps word-> set of record
>> numbers and record number-> set of words.
>> Do You think, it would be possible not only to split namespaces but
>> also allow some numbers point to non-record entities ?
>> We could for instance have record numbers 10^12 till 10^123 pointing
>> to custom objects. There would be a separate splash page for these and
>> search engine could be easily integrated.
>> While showing the search results we could simply look at the number
>> and if it belongs to a certain range, link to /object/12345 rather
>> than /record/12345.
>>
>> Another solution (I imagine, much more difficult), would be to change
>> BibIndex to create maps pointing to a bit more structured identifiers
>> -> r1234 is would be record 1234 and o1234 would be object 1234.
>In both cases, the only solution then would be to abandon intbitset
>which would imply a loss of 10 to 100 times in performance.

I think, it resembles the integration of Lucene by Roman ? 
Merging results from intbitsets and external sources sounds familiar.

>> What do you think about this ? This would allow much clearer treatment
>> of custom objects as they would not have to contain two different
>> entities in Invenio - record and BibDocument (or later BibObject).
>>
>> >From the code point of view, we could have a splash page for a
>> BibDocument (BibObject). These splash pages could be aggregated in
>> record display tabs in the case of
>> havni objects attached to regular bibliographical records.
>>
>>
>> The second considered solution involves having MARC records for every
>> BibObject, but this seems to be a bit cumbersome for several reasons
>>      * The number Invenio entities describing the same object would
>>        grow

>What exactly do you mean here?

BibObject (or BibDoc in current nommenclature) + MARC record linking to this 
only object. Moreover, meta-data would be split betwee MoreInfo and MARC which 
could be not too clear.

>>       * We would have to invent MARC description for every meta-data
>>         field of the DataCite format
>You can always store it as a blob in one subfield and just expand the
>most common MARC info.
what do you mean by MARC info ? Do You mean serialising non-MARC meta-data and 
storing in MARC ?
If so, sounds a bit dirty.


>>       * We would create many records
>Yes, this is true, but it wouldn't be so dramatical. as Invenio should
>be able to scale up to 10.000.000 records (assuming 10 plots per paper
>this is suitable for INSPIRE)

10.000.000 = 10^8 (only for plots in this case)
besides this 10^N of uploaded objects
would we have a lot of spare capacity for future data injestion ? 

>>>       * If we decide to create MARC records for Objects, for the sake
>>>         of consistency we should create records for figures.
>You can decide indeed which object are suitable to have a standalone
>record.

All objects having DOI .... in the future all objects ? 

> P.s. have you seen the ticket (still on Savannah)?
> <https://savannah.cern.ch/task/?7572>
>it is about the possibility to extend BibIndex with pluginutils, and
>officialize all the current Hack, in order to be able to index any
>information related to a record (e.g. comments, derived values from
>MARC, MoreInfo values of objects attached to records). This of course
>assuming BibIndex and WebSearch still consider only one type of unit of
>information, i.e. the record

I have not seen, thanks. Funny that we have had similar considerations in 
connections with BibEdit.
So the ticket would have to change so that the function acepts not only a marc 
record but rather the record ID and has access to the whole data storage.
What would be the metadata field to which the calculated value would be 
attached ? (It does not have a precise position in the MARC record)


>P.p.s. if you still are convinced about the need to separate objects and
>records, since you would any way need to modify the whole Invenio, a
>possible solution, that would not imply loosing too much in performance,
>would be to store objects in separate indexes for objects (that would be
>consulted explicitly by WebSearch), so that you can keep on profiting
>from the speed of intbitset while using the whole low range of integers
>(>starting from 1).

yes... I can see it, though by the initial proposal I wanted to diminish the 
amount of neded work by reusing existing infrastructure.
The need of separation was made when we decided not to upload figures as MARC 
records.

Piotr

RE: [inspire-dev] Several questions about search engine of Invenio and possible extensions

Reply via email to