Re: [lingu-dev] Help needed - bulk extraction of words

Marcin Miłkowski Thu, 07 Feb 2008 16:45:32 -0800

Hi,

you'd need as well to convert these document to pure text in order toprocess them; you can try to spawn OOo for conversion in a batch modebut the easier option is to use unzip in a script, and take content.xmlonly from the files. Then process the files using awk (define the fieldseparator just like you would define the word boundary) and filter outall words that match <[a-z]+>. This should kill all xml from the files.


Regards
Marcin

ge pisze:

Hello,

I did word collection several times using different sources( web sources)

I use linux, but these tools are also available
for windows as gnu tools.

I used awk, like:
 for (i = 1; i <= $NF; i++)
   print $i;

This prints each word in a single line.

Then I sorted the file using sort <infile > outfile
and then used further awk scripts to get rid of word endings,
this is probably much easier for Danish, than for Hungarian.

Good luck! Eleonora


[lingu-dev] Help needed - bulk extraction of words

Hi all,
The Danish project has been so fortunate to receive a bunch of articlesfrom a news magazine. These are odt files and we would like to extractthe words from these documents. We have programs for this purpose, butwe usually get donations one document at the time. This time we haveseveral thousand documents and I believe it would take about a year toload these documents one by one.
Do any of you have a program that can extract words from several documents ?
The words will be loaded into our workflow for linguistic processing andat the end be a part of the Danish spelling directory.
Thanks in advance.



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [lingu-dev] Help needed - bulk extraction of words

Reply via email to