Warning: this is probably more detail than anyone will ever want to
know, but here it is.
Roland Mainz writes:
> "R. Bernstein" wrote:
> >
> > I'd like to a read potentially large files (a zsh script file) into an
> > array fast. It is not uncommon for a GNU autoconf configure script to
> > be tens of thousands of lines long. (As an example, The zsh configure
> > script is over 20,000 lines long).
> >
> > I know about redirecting input in a loop and/or alternatively using
> > "read" in a loop. But for a large file this tends to be slow. As a
> > result, both bash and zsh have a mechanism for doing this faster than
> > using a read loop.
>
> Do you have any examples for the "bash" and "zsh" syntax ?
Zsh is probably the simplest, although it's somewhat convoluted.
zmodload -ap zsh/mapfile mapfile
file_text=( "${(f@)mapfile[/path/to/file]}" )
The double quotes are necessary and (f@) is some sort of field split
operator; mapfile reads in the file as one long string with newlines.
If the final line has a newline then there is an empty string at the
end of the array, else not.
I should have mentioned that storing the text as a long string is okay
provided there is some interface for accessing pieces of the string by
line. Think for example Python's "linecache" module:
http://docs.python.org/lib/module-linecache.html
Since zsh doesn't have as rich a data structure mechansim as ksh, in
fact what I am using is a bit uglier so as to be able to store an
associative array of arrays. The gory details are:
http://github.com/rocky/zshdb/tree/3e1b9229edbf68913a0a1d77816cad05a085e830/lib/filecache.sh
As for bash, that's kind of ugly too. I wrote a custom C module for
that for bashdb. Here is the docstring:
readarray: readarray [-u fd] [-n count] [-O origin] [-t] [-C callback] [-c
quantum] array_variable
Multiple lines are read from the standard input into ARRAY_VARIABLE,
or from file descriptor FD if the -u option is supplied.
Use the `-n' option to specify COUNT number of lines to copy.
If -n is missing or 0 is given as the number all lines are copied.
Use the `-O' option to specify an index ORIGIN to start the array.
If -O is missing the origin will be 0.
Use -t to chop trailing newlines (\n) from lines.
Use the `-C' option to specify CALLBACK which is evaluated according to
progression of reading lines. The evaluation is done each time
QUANTUM number of lines are read as specified via the -c option;
5000 is the default.
The -C and -c options may be useful in implementing a progress bar.
Note: this routine does not clear any previously existing array values.
It will however overwrite existing indices.
So bashdb does something like this:
enable -f /path/to/readarray readarray
builtin readarray -t -O 1 -C 30000 source_array < /path/to/file
Again things are uglier than this for reasons given above, and
although I won't link here you could look at the bashdb sources.
Chet Ramey has adopted this and presumably adapted this for bash 4.0
and it is (probably erroneously) called mapfile sort of to be
reminiscent of zsh even though they work a bit differently. (I may
have been in favor of this before I realized how different the
interfaces were.)
> > So what's the fastest way to large read a file into an array?
>
> How should the file be read into the array ? One array element per line
> ?
Yep. Blank lines are null strings. And some way to figure out if there
is a newline on the end or not. It could be for example that the
newline is at the end of the string. Or again, a Python-like line
cache interface if it helps to save as one long character string.
_______________________________________________
ast-users mailing list
[email protected]
https://mailman.research.att.com/mailman/listinfo/ast-users