On Wed, 27 May 2009, Roger Bivand wrote:

On Tue, 26 May 2009, Nikos Alexandris wrote:

(Cc to Even Roualt ; Apologies to Even since he is not subscribed in the
list)

Roger:
Three minutes instead of thirty+ suggests that the OGR
plugin has trouble with SQLite as the DB format. So maybe
the default for plugin= should be FALSE, not NULL and automatic
use if present?

--%<--
Could you, Nikos,
make a script generating a similar table in spearfish, and two small
scripts exercising the problem (export to R with the plugin, and with
the temporary shapefile.

* The "problem" exists also with the default DBF as a back-end. I
created 1000 random points, filled less than half of the records with
random numbers and readVECT6("x", plugin=TRUE) takes again too much. I
broke the process since it was running for more than 20 mins.

OK. With 250 rows and 250 columns, I see an order of magnitude saving with plugin=FALSE. In plugin=FALSE, the times are split equally between writing the temporary file from GRASS with v.out.ogr, and reading it into R with readOGR(), as one might expect (that is all readVECT6(..., plugin=FALSE) is doing). Even on a small vector (bugsites, 90 points, 2 attribute columns), plugin=FALSE is faster than plugin=TRUE by about 0.75 : 1.35, not quite twice. Which way does the problem scale, in numbers of features, numbers of attribute columns, or both?

Next script in R generating increasing NR and NC cases through writeVECT6() to test plugin=FALSE/plugin=TRUE ratios?

And we also need to check whether the same applies to use of the plugin in other settings - I'm pretty certain this isn't just differential behaviour in readOGR() between the OGR shapefile driver and the OGR GRASS vector driver. Could someone test v.out.ogr against ogr2ogr using the plugin?

Roger


Roger


* A script is pasted on the bottom which has a small "bug" (details
below) :-)


First some results for 1000 rows by 500 columns:

system.time(random_points <- readVECT6("random_points_1000",
plugin=TRUE))
OGR data source with driver: GRASS
Source: "/geo/grassdb/spearfish60/user1/vector/random_points_1000/head",
layer: "1"
with  1000  rows and  501  columns
^C
### This was running for more than 10 hours !!! ###


system.time(random_points <- readVECT6("random_points_1000",
plugin=FALSE))
Exporting 1000 points/lines...
100%
1000 features written
OGR data source with driver: ESRI Shapefile
Source: "/geo/grassdb/spearfish60/user1/.tmp/vertical", layer:
"random_p"
with  1000  rows and  501  columns
Feature type: wkbPoint with 2 dimensions
  user  system elapsed
62.515   9.256  74.013


system.time(random_points <- read.csv("random_points_1000_table.csv"))
  user  system elapsed
 0.192   0.000   0.192



* A script to generate "some" random points, add columns and some R-code
to load with readVECT6( plugin = TRUE ), readVECT6( plugin = FALSE ) and
read.csv.

* The "bug" is that while the variable NUMBER="$[ ( $RANDOM % 100 ) +
1 ]" runs ok under the CLI, it doesn't work from within the bash
script!? So I've commented the respective line and use a fixed number
instead.

--%<---%<---%<---%<---%<---%<---%<---%<---%<---%<---%<---%<-
#!/bin/bash

# example that  readVECT6 ( x , plugin = TRUE )  is too slow
# (also) using the default DBF driver
# first enter in spearfish60/user1

# try with a different back-end?
# db.connect driver=sqlite database=

# set numbers here:
RANDOM_POINTS=100 ; RANDOM_POINTS_CATS=100 ; NUMBER=111

# create RANDOM_POINTS random points
v.random --o output=random_points_`echo ${RANDOM_POINTS}` n=`echo
${RANDOM_POINTS}`

# add in database
v.db.addtable random_points_`echo ${RANDOM_POINTS}`


# add   $"{RANDOM_POINTS}"   columns
echo "\n* Adding ${RANDOM_POINTS} columns"
for x in `seq 1 ${RANDOM_POINTS}` ; do
v.db.addcol random_points_`echo ${RANDOM_POINTS}` column="col_"${x}"
integer"
done ; echo "\n* ${RANDOM_POINTS} columns added"


# check if columns are added
v.info -c random_points_${RANDOM_POINTS}



## WARNING: double loop below takes too long!
# --%<--
# It is simpler and faster to use a single loop with a fixed value
instead, e.g.:
 #for COL in `seq 1 5 ${RANDOM_POINTS}` ; do
 # v.db.update random_points_${RANDOM_POINTS} column="col_"${COL}""
value=222
 #done
# --%<--


# fill some columns/cats with random numbers between 1 and 100
# alter sequence as desired ; more numbers = more time to load in R
for COL in `seq 1 10 ${RANDOM_POINTS}` ; do
for CAT in `seq 1 10 ${RANDOM_POINTS_CATS}` ; do
 # this is ok in the command line but NOT when running the script?
 #NUMBER="$[ ( $RANDOM % 100 ) + 1 ]"
 v.db.update random_points_${RANDOM_POINTS} column="col_"${COL}""
value=${NUMBER} where="cat="${CAT}""
done
done


# [optional] fill in some "-999" values to use as NAs in R?
#NAN=-999
#for COL in `seq 1 5 $"{RANDOM_POINTS}"` ; do
# for CAT in `seq 1 5 $"{RANDOM_POINTS_CATS}"` ; do
#  v.db.update random_points_$"{RANDOM_POINTS}" column="col_"${COL}""
value=$"{NAN}" where="cat="${CAT}""
# done
#done

# check with v.db.select
# v.db.select random_points_${RANDOM_POINTS} | head -25

# export table as .csv file
db.out.ogr in=random_points_${RANDOM_POINTS} format=CSV
dsn=/geo/grassdb/spearfish60/random_points_csv_files
db_table=random_points_${RANDOM_POINTS}.csv

### end of bash script ###


## launch R
R
### R code

# load in R with:
library(spgrass6) ; G <- gmeta6()

#a. readVECT6()
system.time ( random_points <- readVECT6 ( "random_points_100" , plugin
= FALSE ) )

#b. plugin=TRUE
system.time ( random_points <- readVECT6 ( "random_points_100" , plugin
= TRUE ) )

#c. as a csv table
# adjust as required
setwd("/geo/grassdb/spearfish60/random_points_csv_files")
table_to_read <- dir ( pattern = "^random.*.csv$" )
system.time ( random_points <- read.csv ( table_to_read ) )
str(random_points)
--%<---%<---%<---%<---%<---%<---%<---%<---%<---%<---%<---%<-





--
Roger Bivand
Economic Geography Section, Department of Economics, Norwegian School of
Economics and Business Administration, Helleveien 30, N-5045 Bergen,
Norway. voice: +47 55 95 93 55; fax +47 55 95 95 43
e-mail: [email protected]

_______________________________________________
grass-stats mailing list
[email protected]
http://lists.osgeo.org/mailman/listinfo/grass-stats

Reply via email to