Hi,
I have been looking at a mainly shell-script based repository with OpenGrok,
most of the scripts DO NOT end the common ".sh" suffix and are not directly
executable (so are missing the customary "#!" magic), but they all have an
emacs mode line at the top of the file. Example: "# -*- shell-script -*-".
Many scripting type languages have useful first lines, but the current code
base sticks to the magic(1) convention of only reading 4-8 bytes to determine a
file type. The code which does this comes from AnalyzerGuru.java and starts as
follows:
public static FileAnalyzerFactory find(InputStream in) throws IOException {
in.mark(8);
byte[] content = new byte[8];
int len = in.read(content);
in.reset();
My question is: If we are going to read the file (actually InputStream) for 8
bytes, would it make sense to read a few more (say up to 64 or 128) and be able
to use other heuristics to determine the file type?
This could be used to determine the interpretor in use, by parsing lines like
"#!/path/to/interpretor". Or we could look for editor and other hints like
Emacs' mode lines "-*- <ModeName> -*-", and I'm sure others will think of other
heuristics as we go. But we need the extras characters do do this.
The principal issue is that InputStream's don't have to support mark()/reset()
and don't by default, the code handles that IOException. Direct File I/O is
probably fine as the OS should support this without issue, but I wonder if the
code that extracts files from the growing base of repositories OpenGrok is
supporting would handle this? I'm guessing yes, but what do others think?
Is this more generalized solution worth pursuing? Thought/Comments Welcomed!
Regards,
Peter Bray
Sydney, Australia
This message posted from opensolaris.org