Re: [Jprogramming] file read

Raul Miller Thu, 08 Nov 2012 07:18:38 -0800

(responding inline because that makes sense to me, here.)

On Wed, Nov 7, 2012 at 11:53 PM, Toshinari Kamakura
<[email protected]> wrote:
> Thank you for quick responding to my mail.


That was pure luck -- I just happened to be checking my email shortly
after your previous post arrived.

> My desire is just to handle the missing values as R language can do.
>
>> u=read.table("mytestfile.txt",header=F,na.strings="",sep="\t")
>> u
>     V1 V2 V3 V4
> 1    a  1  2  3
> 2    b NA  5  6
> 3    c  7 NA  9
> 4    d 10 11 12
> 5 <NA> 13 14 15
>
> The R system inputs NA values for missing numerical values or <NA> values
> for character values.
>
>
> It seems to me that J can read text file much faster than R especially for
> very large data.

Speed is mostly an issue of what is not being done.  In this case,
some part of R's slower speed comes from R's file format.  R's
implementation of NA values requires information which is not local to
where the NA values appear.

Further, as Brian Schott has pointed out, J does not have an NA value.
 That said, J has four numeric values which are plausible here, for
numeric cases:

_       infinity
__      negative infinity
0
_.      inconsistent (typically yields errors in computations)


For the textual case, a blank is probably the best choice, but the
literal string '<NA>' is of course allowed.

Anyways, when I look at
http://cran.r-project.org/doc/manuals/R-data.html#Spreadsheet_002dlike-data
I see that R has a lot of file format reading features -- an
implementation in J which supported all of those features could easily
be as slow as R, or slower.

That said, here's a few techniques that you might want to use:

Tabular data in J, with typed columns, can be efficiently represented
as a list of equal-length columns (where each column is represented by
a list -- either in a box or in a variable).

You can infer the type of a column by removing blank values and
testing the remaining first entry

You can translate a boxed column of values to a numeric list using ". and unbox

So, here's an example implementation which only implements a subset of
R's features which might be sifficient for a mytestfile.txt:

rishread=:3 :0
  raw=: |: <;._2@,&TAB;._2 -.&CR fread y
  proto=: {.@-.&a:"1 raw
  numeric=: (0&".@> = 1&".@>) proto
  numcol=: [: < {.@".~&__@>@]
  chrcol=: [: < (<'<NA>') [`(I.@:=&a:@])`]} ]
  numeric chrcol`numcol@.["_1 raw
)

example use:

   rishread 'mytestfile.txt'

If you want this code and if an email system corrupts it, let me know
and I will post it in a different fashion.

Note that the values for blank could become parameters (for example,
changing the definition to be a conjunction and using m in place of __
and n in place of '<NA>')

But if file read speed is important to you, you might want to think
about whether you can use a more constrained file format.

Thanks,

-- 
Raul
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jprogramming] file read

Reply via email to