Hi Andi,
I just created a really ugly shell script to split project gutenberg files
into little bite-size pieces. Its user-unfriendly, but it works.
--linas
#! /bin/bash
# Split big project-gutenberrg files into parts.
# takes two arguments: the first argument is the filename to split,
# the second is the filename to generate.
# take file in argument 1, and replace all double-newlines
# by the control-K character.
cat $1 | sed ':a;N;$!ba;s/\n/xxx-foo-xxx/g' > xxx
cat xxx |sed 's/xxx-foo-xxx\rxxx-foo-xxx/\n\x0b\n/g' > yyy
cat yyy |sed 's/\rxxx-foo-xxx/\n/g' > zzz
# split the file along control-K into parts with 50 paragraphs each.
# split -t ' ' zzz poop-
split -l 50 -t '
' --filter=' sed "s/
//g" > $FILE' zzz $2
# remove temps
rm xxx yyy zzz
On Thu, May 11, 2017 at 5:28 PM, Linas Vepstas <[email protected]>
wrote:
> Hi Andi,
>
> Yeah, that's ideal. Did you do this with a script, or by hand? in my
> ideal world, there's some script that downloads a bunch of these from
> project gutenberg, strips out the license boilerplate, and puts them into
> some directory. Busting them up into chapters would be nice, too, so that
> if cogserver chokes and dies, or I have to kill it, it can pick up where it
> left off, more or less.
>
>
> On Thu, May 11, 2017 at 4:48 AM, Andi <[email protected]> wrote:
>
>> does something like this help?
>>
>>
>>> It would really really help if someone could find & prepare some clean
>>> text of some kind of adventure novels or young-adult lit, or any kind of
>>> narrative literature. Maybe from project gutenberg. I've discovered that
>>> wikipedia has 3 major faults:
>>>
>>>
>
--
You received this message because you are subscribed to the Google Groups
"opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/opencog.
To view this discussion on the web visit
https://groups.google.com/d/msgid/opencog/CAHrUA372JO-8xGNZF65tJu_-Mn9kXFe7qeeMMZnzL_qfXz7sHg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.