Hi,

I have been looking at a mainly shell-script based repository with OpenGrok, 
most of the scripts DO NOT end the common ".sh" suffix and are not directly 
executable (so are missing the customary "#!" magic), but they all have an 
emacs mode line at the top of the file. Example: "# -*- shell-script -*-".

Many scripting type languages have useful first lines, but the current code 
base sticks to the magic(1) convention of only reading 4-8 bytes to determine a 
file type. The code which does this comes from AnalyzerGuru.java and starts as 
follows:

    public static FileAnalyzerFactory find(InputStream in) throws IOException {
        in.mark(8);
        byte[] content = new byte[8];
        int len = in.read(content);
        in.reset();

My question is: If we are going to read the file (actually InputStream) for 8 
bytes, would it make sense to read a few more (say up to 64 or 128) and be able 
to use other heuristics to determine the file type?

This could be used to determine the interpretor in use, by parsing lines like 
"#!/path/to/interpretor". Or we could look for editor and other hints like 
Emacs' mode lines "-*- <ModeName> -*-", and I'm sure others will think of other 
heuristics as we go. But we need the extras characters do do this.

The principal issue is that InputStream's don't have to support mark()/reset() 
and don't by default, the code handles that IOException. Direct File I/O is 
probably fine as the OS should support this without issue, but I wonder if the 
code that extracts files from the growing base of repositories OpenGrok is 
supporting would handle this? I'm guessing yes, but what do others think?

Is this more generalized  solution worth pursuing? Thought/Comments Welcomed!

Regards,

Peter Bray
Sydney, Australia
 
 
This message posted from opensolaris.org

Reply via email to