Re: [Scid-users] Database for Scid: CentriScid

Alexander Wagner Wed, 25 Jun 2008 11:58:23 -0700

Pascal Georges wrote:

Hi!


Let me join in here taking up some things. And first let me
remark that there is still a fundamental missunderstanding
of my suggestions. I hope I can clear that up.

> To get a base for Scid that can be used for training (for example) and 
> to keep track of extra info, I think there is a workaround by mixing 
> indexed flags and PGN tags.

Hey! You got it :)

> Each game can get one or several flags that are :
> IDX_FLAG_START         // Game has own start position.
[...]
> IDX_FLAG_USER          // User-defined flag.

It is to be asked if the array of flags could be extended a
bit. That is, that there could be more flags without
breaking any compatibility. (E.g. personally I'd like to
have more user flags.)

> So imagine you have a big (or small ... it also works) DB and want to 
> keep track of tactics. So each relevant game gets the flag
>     IDX_FLAG_TACTICS
> and for example the PGN tag is appended :
>     FLAG_TACTICS_data "Removal of the guard/23/black/easy"
> 
> that is in order : type/move/side/difficulty

You got it again :) Except one single point, I'll come to
that, I'd not need the FLAG_TACTICS_data.

> Searches in base are fast because the prefilter of indexed flags.

Yes! :)

> The data added to "tactics" should be defined once for all and part of 
> Scid's UI.

I think a file of tags that is read in would be more
suitable for this kind of data. I'll explain in a minute
why.

> The most interesting flags are (I think)
>     Middlegame
>     Endgame
>     Tactics

Depends. I also use a lot: white opening, black opening,
brilliancy, blunder and acutally user.

> So for each category could someone list the necessary fields like, for 
> the tactics example :
> 
> Category Tactics :
>     type : pin, overburden, ... , undefined
>     move : the move number
>     side : white or black
>     difficulty : very easy, easy, medium, difficult, very difficult, 
> undefined
>     solved : solved or unsolved
>     comment : free text

Ok, now my suggestions, actually the other way round as your
mail is that way. Lets start with the "keyword" like stuff.

First of all: I did _NEVER_ suggest to reinvent some sytem
like ddc. It was just meant as an example as especially
those collegues from the americas get it with the mothers
milk.

Generally, my ideas have in mind to _MINIMISE_ the work for
the contributors. Not to maximise it. IMHO the contributors
should have the maximal ammount of time for collecting and
indexing not for formal stuff. (Some of the ideas below come
actually from some experiences within my job.)

1. Categories

    I think it is most suitable to define the necessary
    categories while actually building the DB. One could set
    up a hand full and add new ones as necessary. You surely
    will not know all categories necessary nor whether those
    invented are used as well. (This is direct experience
    from my job with our rule sets, and those are really
    complex rules.)

    Why: this avoids a lot of theoretical "think about" and
    the invention of groups that are actually not needed.

    How: if a category is defined write it down. It is
    essential that all people in the group use the same words
    for the categories.

    Build it into scid: well this is the way arround using
    UIDs for categories and keywords. Not a flexible way but
    a doable way. I'd suggest to read them in from a file,
    though. That way every contributor could add a new term
    if (s)he feels necessary without touching program code.

2. Keywords

    Pascal names some which are surely suitable, I'd suggest
    to put them in a list that is also extended while
    building the DB. You most likely will not know all terms
    necessary.

    Why and how: see 1. Categories

3. Flags

    Its IMHO essential to use them to allow for fast
    seraching. I'd suggest to set flags while building up the
    reference DB I though aloud about.

    Why: Flagging it correctly together with proper
    keywording allows to draw exatly those games from the big
    DB to create a small, specialised trainings collection
    from it.

    This saves work as it is done only once, ie. those
    working on the RefDB also create the trainings DB's on
    the fly and vice versa. (Ie. someone that sets up a
    training DB could add the games to stage 3, see below.
    They would not have to be complete yet!)

4. UID

    IMHO this is essential. That there are no UIDs is IMHO a
    major current drawback in scid. I really miss this
    feature for a while.

    Why: it allows unique referencing of a game within a DB.
    The game number is not a good ID, as it changes if the
    base is resorted, appended, compressed and so on.

    A unique ID is the _only_ reliable way for a computer to
    get a unique answer for a query. Michals argument "if I
    know that game I know the players" only holds as long as
    there is Michal looking at it, it no longer holds once I
    want to make automatic queries e.g. to draw out a certain
    part of the DB. (See next point.)

    How: it should be unique, allow the identification of the
    associated base and it should allow for versioning as a
    game may change in the history of the DB. All this should
    be refleced by the UID.

    I suggest (again):

    <basename>:<gameversion>-<number>

    And this ID to be placed in a normal PGN header field. I
    suggest to use CmailGameName as this is already used in
    CC code for making up the lack of UIDs. (The name has to
    do with compatibility to cmail used in email chess.)

    The idea of this format is, that even a computer program
    could extract from the UID:

       - which database to load
       - which game number to search
       - check if the games version is correct

    This allows for things to be done automatically later on
    if need arrises. You can say: this need does not arrise.
    I tell you from experience: it does and you'l damn the
    day you decided not to add this single line of metadata.
    (I do already form time to time. ;)

    For <number> I suggest that each contributor to the base
    gets a block of numbers to use up and then is assigned a
    new block. The numbers would not be subsequent but this
    eases up the procedure and avoids that the same number is
    given twice.

    I strongly suggest that UIDs of games that get deleted
    are _NOT_ reused ever. (There's an infinite ammount of
    numbers, no need to be tight with them here.)

    This does _not_ necessarily mean that the ref-db contains
    all versions that ever existed (though I think this has
    some charming side effects). BUT it allows to distribute
    a diff between versions of the DB easily. Distributing a
    diff is not a matter of bandwidth, it is a matter of
    convenience for the user: if I add my own additions to
    the ref db (e.g. analysis I did) I do not want to trow it
    away cause a new db comes out. I want to smoothly add the
    new contents.

    I suggest that there is a database index created for the
    new UID field to allow for searches in this particular
    field which are as fast as the common searches for
    players e.g.

5. Trainings DB and RefDB (ie. CentriScid)

    If I talked about CentriScid I always talked about a,
    hopefully large, high quality reference db. This is
    important. I never refered to small specialised trainings
    DBs. CentriScid for me is meant to get large.

    I suggest to create this large reference DB with a strong
    focus on quality concerning header tags (what I call
    metadata), completeness with regards to tournaments and
    events, move orders as far as this is possible. I'd place
    "as big as possible" not as the primary target, but "as
    reliable as possible".

    I suggest that this DB is set up by volunteers here from
    the community that want to contribute to the scid project
    as a whole which is not only about a piece of software. I
    feel, that we've many pretty good players arround that
    would do better to contribute their chess knowledge in
    building up such a DB than sit down and learn TCL ;) (My
    chess unfortunatly does not get better as I get better in
    Tcl. Well, Pascal may have his doubts whether even the
    latter happens ;)

    I suggest that the minimum header field info required is
    layed down somwhere in written form for easy reference.
    And I suggest that any additional information that is
    available is not thrown away but kept.

    I suggest that especially those of you join the community
    that skimm through games regularily anyway and do
    something like building up such a DB for their own. To
    those I suggest to share their work in this community
    efford.

    I suggest to build this base in 3 stages:

    - first stage: A new event comes in, the PGNs are added,
      but not yet checked at all. At this stage each game is
      already assinged a UID of the form

      CentriScid-00-<number>

      This stage also allows for the addition of empty games
      as well as unfinished games. Typically, TWIC would end
      up each week in this stage.

    - second stage: the event (tournament or whatever) is
      checked for completeness, formal things get checked ie.
      spelling of the event consistency of the naming (is it
      ol, olympiad, olym. or what?) All checked games get a
      promoted UID

      CentriScid:05-<number>

      where <number> is _not_ changed. It is not necessary to
      keep the old, uncorrected version. Whether to do this
      or not is up to the community to decide.

      Also doublett games get removed here. This is of
      special importance!

      At this point games have to be finished, tournaments
      are complete. They'd stay in stage 1 as long as they
      are complete. Ie. scid can produces cross tables at
      stage 2, stuff like that.

    - third stage: someone went over the events in the second
      stage, gives flags and keywords, probably fixes errors
      in move ordering or whatever if they can be found. (See
      also below on "indexing".) These games get their final
      UID

      CentriScid:<release>-<number>

      <number> is still the same, <relase> is counted up
      for each release. Most likely games at this stage will
      never change. But if an error is found later on and a
      game moved to stage 3 at release 12 is to be corrected
      and the next release is 17 it gets a promoted UID

      CentriScid:17-<number>

      <number> still stays the same. But one can now see that
      this game was touched and someone who is at release 14
      and wants to upgrade needs to get all games that are
      labled CentriScid:15-* till CentriScid:17-*. A suitable
      diff is easily created from the games UID. (Note that
      there is a large company in hamburg that can not do
      this. I asked explicitly. They where very polite to
      tell me they can not accomplish this and are missing
      this feature themselves.)

      Still its up to the community to decide whether the
      V:12 and the V:17 version of the game in question is
      kept. As it would be enough to store the V:12 database
      somewhere there is no need to generate doubletts within
      the same DB. I never suggested to create double
      entries, especially I explicitly suggest to remove them
      for the release version, but I do suggest to keep track
      if a game gets changed later on.

      A release is made from time to time by freezing the DB
      makeing a clear cut and give it free with a suitable
      number for general download. A criterion could be a
      certain ammount of changes compared to the last
      release.

    In the following block "indexing" refers to "give
    keywords or flags or whatever". It is not the technical
    generation of a database index. (The german word I refer
    to would be "Erschliessung", unfortunately "indexing" has
    two meanings in english.)

    I also suggest to intellecutal index the DB, ie set
    keywords wherever the community feels necessary to point
    the user to a very nice game, an interesting variation
    and so on. I'd leave the depth at which this is done to
    those who do the indexing. Ie. I don't feel it is
    necessary that every game is analysed with the assistence
    of a GM. Some do a deeper indexing others just check the
    formal things as they have no time for in depth checks at
    the moment. The formal stuff should be done thoroughly
    and it should be the target not to touch the header once
    a game reaches stage 3. Beyond that, I suggest that it is
    up to the contributors. I'd also encourage pointers to
    the literature, that is a [Ref "Author: Book, page,
    number"] style PGN tag if known and the game is commented
    somewhere. I could provide some of those in case of
    interest.

    I'd make CentriScid open to new tags added by the
    community of users that are not actively working on
    CentriScid. But this _requires_ to have UIDs for you to
    know at exactly which game to add infos provided from
    outside. This may include the need for versioning as
    there might be a combination found only sake of an error
    in V:12 which is fixed in the current V:17.  One could
    just send in a mail of the form: "I found ...  in <UID>
    please add these tags and that solution". For
    combinations at a certain stage, Pascals suggestion above
    could be used to add the necessary data.

    I suggest to make extensive use of flags for this
    indexing for fast search plus the use of predefined PGN
    header fields which point to the interesting part by
    normalised keywords. See above, Pascals suggestion is
    pretty close to what I was talking about all the time.

    I suggest to build a _simple_ list of keyword terms not a
    complex set. But it should be clear from this list
    which is a broader and the narrower term. Indexing should
    always use the narrowest term possible to describe the
    thing. (Example: if you have a book about scid and you
    could use between Database and Chessdatabase use
    Chessdatabase as it is closer to the thing.) This list
    can IMHO best be build while the DB is growing. It makes
    not much sense to invent it beforehand.

    I suggest to generate trainings DBs out of the large DB
    as specific subsets dump. Eg. make a query against
    CentriScid that selects all Rook Endings and copy this to
    a DB for endgame training. At this point my primary idea
    was not to place the whole game in this trainings db but
    just the interesting start position and the solution.
    PLUS a PGN header field containing the UID (see 4.) of
    the game itself for the user to easily look it up
    entirely. Hence, CentriScid would contain the _whole_
    game but you would not do your training against
    CentriScid but against a specialised partial dump that
    looks like the current trainings dbs.

    I suggest to set up a simple automatic query tool to
    accomplish the generation of such a partial dump. This is
    does not needed to be done by the community to set up the
    CentriScid DB.


-- 

Kind regards,                /                 War is Peace.
                             |            Freedom is Slavery.
Alexander Wagner            |         Ignorance is Strength.
                             |
                             | Theory     : G. Orwell, "1984"
                            /  In practice:   USA, since 2001

-------------------------------------------------------------------------
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php
_______________________________________________
Scid-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scid-users

Re: [Scid-users] Database for Scid: CentriScid

Reply via email to