In our case we store in the field 995__a the timestamp describing the
status of the record in the old ADS and that we use to check if the
record needs to be updated in Invenio.
This timestamp is actually a JSON structure describing the status of
all the files that in the old ADS generate a record.
At the moment we don't populate the bib99x with references or other
fields, so we only have timestamps.

If I run the query you described I have:
mysql> SELECT COUNT(DISTINCT(SUBSTR(value,1,35)))/COUNT(DISTINCT(value))
FROM bib99x;
+------------------------------------------------------------+
| COUNT(DISTINCT(SUBSTR(value,1,35)))/COUNT(DISTINCT(value)) |
+------------------------------------------------------------+
|                                                     0.0305 |
+------------------------------------------------------------+
1 row in set (2 min 23.93 sec)

So we have only the 3% of record covered with 35 characters.

If I run
mysql> SELECT COUNT(DISTINCT(SUBSTR(value,1,100)))/COUNT(DISTINCT(value))
FROM bib99x;
+-------------------------------------------------------------+
| COUNT(DISTINCT(SUBSTR(value,1,100)))/COUNT(DISTINCT(value)) |
+-------------------------------------------------------------+
|                                                      0.8200 |
+-------------------------------------------------------------+
1 row in set (3 min 7.38 sec)

We have 82% records with 100 characters.

With 150 characters nothing really changes:

mysql> SELECT COUNT(DISTINCT(SUBSTR(value,1,150)))/COUNT(DISTINCT(value))
FROM bib99x;
+-------------------------------------------------------------+
| COUNT(DISTINCT(SUBSTR(value,1,150)))/COUNT(DISTINCT(value)) |
+-------------------------------------------------------------+
|                                                      0.8262 |
+-------------------------------------------------------------+
1 row in set (3 min 8.76 sec)

While with 200 characters we have 98% of the records

mysql> SELECT COUNT(DISTINCT(SUBSTR(value,1,200)))/COUNT(DISTINCT(value))
FROM bib99x;
+-------------------------------------------------------------+
| COUNT(DISTINCT(SUBSTR(value,1,200)))/COUNT(DISTINCT(value)) |
+-------------------------------------------------------------+
|                                                      0.9819 |
+-------------------------------------------------------------+
1 row in set (3 min 25.05 sec)

And with 250 characters we have almost all the records:

mysql> SELECT COUNT(DISTINCT(SUBSTR(value,1,250)))/COUNT(DISTINCT(value))
FROM bib99x;
+-------------------------------------------------------------+
| COUNT(DISTINCT(SUBSTR(value,1,250)))/COUNT(DISTINCT(value)) |
+-------------------------------------------------------------+
|                                                      0.9945 |
+-------------------------------------------------------------+
1 row in set (3 min 29.03 sec)

So I think that in our case 200 characters is the minimum value for this field.

Giovanni


On Mon, Mar 19, 2012 at 8:40 AM, Tibor Simko <[email protected]> wrote:
> On Mon, 19 Mar 2012, Tibor Simko wrote:
>> BTW, you seem to have a limit of 200 for bib99x now.  I think your URLs
>> may all fit well into 100, isn't it?
>
> Here is how you can check what is the percentage of bib99x values that
> would be fitting into 35; an example for INSPIRE:
>
>   mysql> SELECT COUNT(DISTINCT(SUBSTR(value,1,35)))/COUNT(DISTINCT(value)) 
> FROM bib99x;
>   +------------------------------------------------------------+
>   | COUNT(DISTINCT(SUBSTR(value,1,35)))/COUNT(DISTINCT(value)) |
>   +------------------------------------------------------------+
>   |                                                     0.9892 |
>   +------------------------------------------------------------+
>   1 row in set (36.51 sec)
>
> So indexing only 35 leading characters for INSPIRE site conditions
> nicely covers 98.9% of our reference values.
>
> Best regards
> --
> Tibor Simko

Reply via email to