Michel Weimerskirch escribió:
> Hi
>
> Personal dictionaries (stored in "standard.dic") seem to be in a
> binary format. Is there a tool that can convert them to text-files in
> order to process them?
>
> A large Luxembourg-based company is interested in deploying OOo on a
> number of machines in order to use the Luxembourgish spellchecking
> dictionary I developed. They offered to regularly send me their
> personal dictionaries with the words that the spellchecker doesn't
> recognise yet, so I need a means to streamline the process of
> converting and analysing those.
>
Michel:
I use this simple C program, that does the trick.
It seems first 11 bytes are some kind of header, probably stating the
language locale and so. After that, a word (2-bytes) indicate how much
characters has the next word, and then the word in UTF8, this structure
count-word repeats for each word in the dictionary.
Here's the C simple code:
*//*
* extraer.c: Extrae el listado de palabras de un diccionario
* personal .dic de OpenOffice.org.
*
* Para compilar el programa ejecute: "gcc -o extraer extraer.c"
*
* Utilización: "extraer < fichero.dic > listado.txt"
*
* (c) 2005, Santiago Bosio.
* Este programa se distribuye bajo licencia GNU GPL.
*
*/
#include <stdio.h>
#include <stdlib.h>
int main (int argc, char *argv[])
{
int largo = 0;
unsigned char palabra[100];
/* Ignorar los primeros once bytes: encabezado */
if ( fread (palabra, sizeof(unsigned char), 11, stdin) < 11 )
{
fprintf (stderr, "Error: No es un diccionario válido.\n");
exit (1);
}
if ( fread (&largo, 2, 1, stdin) <= 0 )
{
fprintf (stderr, "El diccionario no contiene palabras.\n");
exit (1);
}
while ( !feof (stdin) )
{
if ( largo > 100 ) /* Saltear las palabras largas (errores) */
{
fprintf (stderr, "Error: palabra demasiado larga.\n");
fseek (stdin, (long) largo, SEEK_CUR);
}
else
{
fread (palabra, sizeof(unsigned char), (size_t) largo, stdin);
fwrite (palabra, sizeof(unsigned char), (size_t) largo, stdout);
fprintf (stdout, "\n");
}
fread (&largo, 2, 1, stdin);
}
return (0);
}
/*
Hope this helps you. Best regards,
Santiago.