I think if you start calculating the Zobrist hashes and scraping features yourself you will have a neverending variety of datasets.

I would prefer datasets of whole, high quality games without SGF errors, perhaps cleaned of identifying information. Parsing an SGF is already trivial. I personally divide them in:

- Handicap used or not
- Normal (5.5 - 7.5) or not komi, this disqualifies some older games
- Rules used
- Board size

Following the idea of having more information instead of very specific features already extracted, it would be interesting to also have the playing times, although I don't know where you'd get that from.

You'd be an angel if you could provide a large dataset of matches with Chinese rules, specially in board sizes other than 19x19.

It would of course also have to be completely free for any use. I personally only use the KGS 6d+ and a collection of 70k pro games that I don't know where it came from. The GoGoD is proprietary. :)

Gonçalo F.

On 11/13/2015 08:39 AM, Josef Moudrik wrote:
Hello List,

There has been some debate in science about making the research more
reproducible and open. Recently, I have been thinking about making a
standard public fixed dataset of Go games, mainly to ease comparison of
different methods, to make results more reproducible and maybe free the
authors of the burden of composing a dataset. I think that the current
practice can be improved a lot.

Since the success of this endeavor crucially depends on how many authors
use the dataset, I would like to ask You (potential authors) a few

1) Would this be welcomed and used? Would You personally use it? (Am I not
reinventing the wheel?)

2) What parameters should the dataset have? The number of dataset variants
(if any) should be in my opinion kept at bare minimum to reduce

2a) Size: My current view is that at least 2 sizes are necessary: small
(1000-2000 games?) and large dataset (50000-60000 games).
2b) Strength & year span: Currently I am thinking about including modern
professional games only (1970-2015)

3) Do you have any other comments, requirements for the dataset and ideas?

Thanks for Your attention,
Kind regards
Josef Moudrik

