Re: [Solr Wiki] Update of "DataImportHandler" by FergusMcMenemie

Chris Hostetter Fri, 21 May 2010 12:46:40 -0700

In general please don't make changes like this -- there are still lots of 
people using SOlr 1.3 (even some using Solr 1.2), and in many cases the 
wiki is hte only (user level) documentation we have --- if a config option 
notes that it is available as of Solr1.4, that is very important to leave 
in the wiki (as people who attempt to use it may not notice that it has no 
effect in older versions)


: Date: Fri, 21 May 2010 10:14:58 -0000
: From: Apache Wiki <wikidi...@apache.org>
: Reply-To: solr-...@lucene.apache.org
: To: Apache Wiki <wikidi...@apache.org>
: Subject: [Solr Wiki] Update of "DataImportHandler" by FergusMcMenemie
: 
: Dear Wiki user,
: 
: You have subscribed to a wiki page or wiki category on "Solr Wiki" for change 
notification.
: 
: The "DataImportHandler" page has been changed by FergusMcMenemie.
: The comment on this change is: removing notes covering differences between 
solr versions 1.3 / 1.4 (we are now heading for 1.5! ).
: http://wiki.apache.org/solr/DataImportHandler?action=diff&rev1=246&rev2=247
: 
: --------------------------------------------------
: 
:   = Data Import Request Handler =
: - <!> [[Solr1.3]]
:   
:   Most applications store data in relational databases or XML files and 
searching over such data is a common use-case. The DataImportHandler is a Solr 
contrib that provides a configuration driven way to import this data into Solr 
in both "full builds" and using incremental delta imports.
:   
: @@ -89, +88 @@
: 
:    * '''`password`''' : The password
:    * '''`batchSize`''' : The batchsize used in jdbc connection
:    * '''`convertType`''' :(true/false)Default is 'false' Automatically reads 
the data in the target Solr data-type
: -  * '''`autoCommit`''' : If set to 'false' it sets  `setAutoCommit(false)` 
<!> [[Solr1.4]]
: +  * '''`autoCommit`''' : If set to 'false' it sets  `setAutoCommit(false)`
: -  * '''`readOnly`''' : If this is set to 'true' , it sets 
`setReadOnly(true)`, `setAutoCommit(true)`, 
`setTransactionIsolation(TRANSACTION_READ_UNCOMMITTED)`,`setHoldability(CLOSE_CURSORS_AT_COMMIT)`
 on the connection <!> [[Solr1.4]]
: +  * '''`readOnly`''' : If this is set to 'true' , it sets 
`setReadOnly(true)`, `setAutoCommit(true)`, 
`setTransactionIsolation(TRANSACTION_READ_UNCOMMITTED)`,`setHoldability(CLOSE_CURSORS_AT_COMMIT)`
 on the connection
: -  * '''`transactionIsolation`''' : The possible values are 
[TRANSACTION_READ_UNCOMMITTED, TRANSACTION_READ_COMMITTED, 
TRANSACTION_REPEATABLE_READ,TRANSACTION_SERIALIZABLE,TRANSACTION_NONE] <!> 
[[Solr1.4]]
: +  * '''`transactionIsolation`''' : The possible values are 
[TRANSACTION_READ_UNCOMMITTED, TRANSACTION_READ_COMMITTED, 
TRANSACTION_REPEATABLE_READ,TRANSACTION_SERIALIZABLE,TRANSACTION_NONE]
:   
:   
:   Any extra attributes put into the tag are directly passed on to the jdbc 
driver.
: @@ -99, +98 @@
: 
:   == Configuration in data-config.xml ==
:   A Solr document can be considered as a de-normalized schema having fields 
whose values come from multiple tables.
:   
: - The data-config.xml starts by defining a `document` element. A `document` 
represents one kind of document.  A document contains one or more root 
entities. A root entity can contain multiple sub-entities which in turn can  
contain other entities. An entity is a table/view in a relational database. 
Each entity can contain multiple fields. Each field corresponds to a column in 
the resultset returned by the ''query'' in the entity. For each field, mention 
the column name in the resultset. If the column name is different from the solr 
field name, then another attribute ''name'' should be given. Rest of the 
required attributes such as ''type'' will be inferred directly from the Solr 
schema.xml. (Can be overridden)
: + The data-config.xml starts by defining a `document` element. A `document` 
represents one kind of document.  A document contains one or more root 
entities. A root entity can contain multiple sub-entities which in turn can 
contain other entities. An entity is a table/view in a relational database. 
Each entity can contain multiple fields. Each field corresponds to a column in 
the result set returned by the ''query'' in the entity. For each field, mention 
the column name in the result set. If the column name is different from the 
solr field name, then another attribute ''name'' should be given. Rest of the 
required attributes such as ''type'' will be inferred directly from the Solr 
schema.xml. (Can be overridden)
:   
:   In order to get data from the database, our design philosophy revolves 
around 'templatized sql' entered by the user for each entity. This gives the 
user the entire power of SQL if he needs it. The root entity is the central 
table whose columns can be used to join this table with other child entities.
:   
: @@ -113, +112 @@
: 
:    * '''`threads`''' :  The no:of of threads to use to run this entity. This 
must be placed on or above a 'rootEntity'. [[Solr1.5]] 
:    * '''`pk`''' : The primary key for the entity. It is '''optional''' and 
only needed when using delta-imports. It has no relation to the uniqueKey 
defined in schema.xml but they both can be the same. 
:    * '''`rootEntity`''' : By default the entities falling under the document 
are root entities. If it is set to false , the entity directly falling under 
that entity will be treated as the root entity (so on and so forth). For every 
row returned by the root entity a document is created in Solr
: -  * '''`onError`''' : (abort|skip|continue) . The default value is 'abort' . 
'skip' skips the current document. 'continue' continues as if the error did not 
happen . <!> [[Solr1.4]]
: +  * '''`onError`''' : (abort|skip|continue) . The default value is 'abort' . 
'skip' skips the current document. 'continue' continues as if the error did not 
happen.
: -  * '''`preImportDeleteQuery`''' : before full-import this will be used to 
cleanup the index instead of using '*:*' .This is honored only on an entity 
that is an immediate sub-child of <document> <!> [[Solr1.4]].
: +  * '''`preImportDeleteQuery`''' : before full-import this will be used to 
cleanup the index instead of using '*:*' .This is honored only on an entity 
that is an immediate sub-child of <document>.
: -  * '''`postImportDeleteQuery`''' : after full-import this will be used to 
cleanup the index <!>. This is honored only on an entity that is an immediate 
sub-child of <document> [[Solr1.4]].
: +  * '''`postImportDeleteQuery`''' : after full-import this will be used to 
cleanup the index <!>. This is honoured only on an entity that is an immediate 
sub-child of <document>.
:   For !SqlEntityProcessor the entity attributes are :
:   
:    * '''`query`''' (required) : The sql string using which to query the db
:    * '''`deltaQuery`''' : Only used in delta-import
:    * '''`parentDeltaQuery`''' : Only used in delta-import
:    * '''`deletedPkQuery`''' : Only used in delta-import
: -  * '''`deltaImportQuery`''' : (Only used in delta-import) . If this is not 
present , DIH tries to construct the import query by(after identifying the 
delta) modifying the '`query`' (this is error prone). There is a namespace 
`${dataimporter.delta.<column-name>}` which can be used in this query.  e.g: 
`select * from tbl where id=${dataimporter.delta.id}`  <!> [[Solr1.4]].
: +  * '''`deltaImportQuery`''' : (Only used in delta-import) . If this is not 
present , DIH tries to construct the import query by(after identifying the 
delta) modifying the '`query`' (this is error prone). There is a namespace 
`${dataimporter.delta.<column-name>}` which can be used in this query.  e.g: 
`select * from tbl where id=${dataimporter.delta.id}`.
:   
:   
:   == Commands ==
: @@ -225, +224 @@
: 
:   </dataConfig>
:   }}}
:   
: + 
:   == Using delta-import command ==
:   Delta Import operation can be started by hitting the URL 
[[http://localhost:8983/solr/dataimport?command=delta-import]]. This operation 
will be started in a new thread and the ''status'' attribute in the response 
should be shown ''busy'' now. Depending on the size of your data set, this 
operation may take some time. At any time, you can hit 
[[http://localhost:8983/solr/dataimport]] to see the status flag.
:   
:   When delta-import command is executed, it reads the start time stored in 
''conf/dataimport.properties''. It uses that timestamp to run delta queries and 
after completion, updates the timestamp in ''conf/dataimport.properties''.
: + 
:   
:   === Delta-Import Example ===
:   We will use the same example database used in the full import example. Note 
that the database schema has been updated and each table contains an additional 
column ''last_modified'' of timestamp type. You may want to download the 
database again since it has been updated recently. We use this timestamp field 
to determine what rows in each table have changed since the last indexed time.
: @@ -311, +312 @@
: 
:    * For each row given by ''deltaQuery'', the parentDeltaQuery is executed.
:    * If any row in the root/child entity changes, we regenerate the complete 
Solr document which contained that row.
:   
: - /!\ Note :  The 'deltaImportQuery' is a Solr 1.4 feature. Originally it was 
generated automatically using the 'query' attribute which is error prone.
:   /!\ Note : It is possible to do delta-import using a full-import command . 
[[http://wiki.apache.org/solr/DataImportHandlerFaq#fullimportdelta|See here]]
:   
:   = Usage with XML/HTTP Datasource =
: @@ -319, +319 @@
: 
:   
:   <<Anchor(httpds)>>
:   
: + 
: - == Configuration of URLDataSource or HttpDataSource ==
: + == Configuration of URLDataSource ==
:   
: - <!> !HttpDataSource is being deprecated in favour of URLDataSource in 
[[Solr1.4]]
: - 
: - Sample configurations for URLDataSource <!> [[Solr1.4]] and !HttpDataSource 
in data config xml look like this
: + Sample configurations for URLDataSource in data config xml look like this
:   {{{
: - <dataSource name="b" type="!HttpDataSource" baseUrl="http://host:port/"; 
encoding="UTF-8" connectionTimeout="5000" readTimeout="10000"/>
: - <!-- or in Solr 1.4-->
:   <dataSource name="a" type="URLDataSource" baseUrl="http://host:port/"; 
encoding="UTF-8" connectionTimeout="5000" readTimeout="10000"/>
:   }}}
:   ''' The extra attributes specific to this datasource are '''
: @@ -336, +333 @@
: 
:    * '''`connectionTimeout`''' (optional):The default value is 5000ms
:    * '''`readTimeout`''' (optional): the default value is 10000ms
:   
: + 
:   == Configuration in data-config.xml ==
: - 
:   The entity for an xml/http data source can have the following attributes 
over and above the default attributes
:    * '''`processor`''' (required) : The value must be `"XPathEntityProcessor"`
:    * '''`url`''' (required) : The url used to invoke the REST API. (Can be 
templatized). if the data souce is file this must be the file location
: @@ -345, +342 @@
: 
:    * '''`forEach`'''(required) : The xpath expression which demarcates a 
record. If there are multiple types of record separate them with ''" |  "'' 
(pipe) . If  useSolrAddSchema is set to 'true' this can be omitted.
:    * '''`xsl`'''(optional): This will be used as a preprocessor for applying 
the XSL transformation. Provide the full path in the filesystem or a url.
:    * '''`useSolrAddSchema`'''(optional): Set it's value to 'true' if the xml 
that is fed into this processor has the same schema as that of the solr add 
xml. No need to mention any fields if it is set to true.
: -  * '''`flatten`''' (optional) : If this is set to true, text from under all 
the tags are extracted into one field , irrespective of the tag name. <!> 
[[Solr1.4]]
: +  * '''`flatten`''' (optional) : If this is set to true, text from under all 
the tags are extracted into one field , irrespective of the tag name.
: - 
:   
:   The entity fields can have the following attributes (over and above the 
default attributes):
:    * '''`xpath`''' (optional) : The xpath expression of the field to be 
mapped as a column in the record . It can be omitted if the column does not 
come from an xml attribute (is a synthetic field created by a transformer). If 
a field is marked as multivalued in the schema and in a given row of the xpath 
finds multiple values it is handled automatically by the XPathEntityProcessor. 
No extra configuration is required
: @@ -366, +362 @@
: 
:   }}}
:   
:   
: - == HttpDataSource Example ==
: + == URLDataSource Example ==
: - <!> !HttpDataSource is being deprecated in favour of URLDataSource in 
[[Solr1.4]]
: - 
:   Download the full import example given in the DB section to try this out. 
We'll try indexing the [[http://rss.slashdot.org/Slashdot/slashdot|Slashdot RSS 
feed]] for this example.
:   
: - 
:   The data-config for this example looks like this:
:   {{{
:   <dataConfig>
: -         <dataSource type="HttpDataSource" />
: +         <dataSource type="URLDataSource" />
:       <document>
:               <entity name="slashdot"
:                       pk="link"
: @@ -457, +450 @@
: 
:   <copyField source="title" dest="titleText"/>
:   }}}
:   
: - Time taken was around 2 hours 40 minutes to index 7278241 articles with 
peak memory usage at around 4GB. Note that many wikipedia articles are merely 
redirects to other articles, the use of $skipDoc <!> [[Solr1.4]] allows those 
articles to be ignored. Also, the column '''$skipDoc''' is only defined when 
the regexp matches.
: + Time taken was around 2 hours 40 minutes to index 7278241 articles with 
peak memory usage at around 4GB. Note that many wikipedia articles are merely 
redirects to other articles, the use of $skipDoc allows those articles to be 
ignored. Also, the column '''$skipDoc''' is only defined when the regexp 
matches.
: + 
:   
:   == Using delta-import command ==
:   The only !EntityProcessor which supports delta is !SqlEntityProcessor! The 
XPathEntityProcessor has not implemented it yet. So, unfortunately, there is no 
delta support for XML at this time.
:   If you want to implement those methods in XPathEntityProcessor: The methods 
are explained in !EntityProcessor.java.
:   
: + 
:   = Indexing Emails =
:   See MailEntityProcessor
:   
: + 
:   = Tika Integration =
:   [[TikaEntityProcessor]]
:   
: + 
:   = Extending the tool with APIs =
:   The examples we explored are admittedly, trivial . It is not possible to 
have all user needs met by an xml configuration alone. So we expose a few 
abstract class which can be implemented by the user to enhance the 
functionality.
:   
:   <<Anchor(transformer)>>
: + 
: + 
:   == Transformer ==
:   Every set of fields fetched by the entity can be either consumed directly 
by the indexing process or they can be massaged using transformers to modify a 
field or create a totally new set of fields, it can even return more than one 
row of data. The transformers must be configured on an entity level as follows.
:   {{{
: @@ -488, +487 @@
: 
:   existing field will be unaltered and an undefined field will remain 
undefined. The chaining effect described above allows a column's value to be 
altered again and again by successive transformers. A transformer may make use 
of other entity fields in the course of massaging a columns value.
:   
:   
: - 
:   === RegexTransformer ===
: - 
:   There is an built-in transformer called '!RegexTransfromer' provided with 
DIH. It helps in extracting or manipulating values from fields (from the 
source) using Regular Expressions. The actual class name is 
`org.apache.solr.handler.dataimport.RegexTransformer`. But as it belongs to the 
default package the package-name can be omitted.
: - 
: - 
:   
:   '''Attributes'''
:   
: @@ -501, +496 @@
: 
:    * '''`regex`''' : The regular expression that is used to match against the 
column or sourceColName's value(s). If `replaceWith` is absent, each regex 
''group'' is taken as a value and a list of values is returned
:    * '''`sourceColName`''' : The column on which the regex is to be applied. 
If this is absent source and target are same
:    * '''`splitBy`''' : Used to split a String to obtain multiple values, 
returns a list of values
: -  * '''`groupNames`''' : A comma separated list of field column names, used 
where the `regex` contains groups and each group is to be saved to a different 
field. If some groups are not to be named leave a space between commas.  <!> 
[[Solr1.4]]
: +  * '''`groupNames`''' : A comma separated list of field column names, used 
where the `regex` contains groups and each group is to be saved to a different 
field. If some groups are not to be named leave a space between commas.
:    * '''`replaceWith`''' : Used along with `regex` . It is equivalent to the 
method `new String(<sourceColVal>).replaceAll(<regex>, <replaceWith>)`
:   
:   example:
: @@ -628, +623 @@
: 
:   
:    * '''`template`''' : The template string. In the above example there are 
two placeholders '${e.name}' and '${eparent.surname}' .   Both the values must 
be present when it is being evaluated.
:   
: + 
:   === HTMLStripTransformer ===
: - <!> [[Solr1.4]]
:   
:   Can be used to strip HTML out of a string field
:   e.g.:
: @@ -646, +641 @@
: 
:    * '''`stripHTML`''' : Boolean value to signal if HTMLStripTransformer 
should process this field or not.
:   
:   === ClobTransformer ===
: - <!> [[Solr1.4]]
:   
:   Can be used to create a String out of a Clob type in database.
:   e.g.:
: @@ -663, +657 @@
: 
:    * '''`sourceColName`''' : The source column to be used as input. If this 
is absent source and target are same
:   
:   === LogTransformer ===
: - <!> [[Solr1.4]]
:   
:   Can be used to Log data to console/logs.
:   e.g.:
: @@ -680, +673 @@
: 
:   <<Anchor(example-transformers)>>
:   === Transformers Example ===
:   
: - <!> [[Solr1.4]] The following example shows transformer chaining in action 
along with extensive reuse of variables. An invariant is defined in the 
solrconfig.xml and reused within some transforms. Column names from both 
entities are also used in transforms.
: + The following example shows transformer chaining in action along with 
extensive reuse of variables. An invariant is defined in the solrconfig.xml and 
reused within some transforms. Column names from both entities are also used in 
transforms.
:   
:   Imagine we have XML documents, each of which describes a set of images. The 
images are stored in an images subdirectory of the XML document. An attribute 
storing an images filename is accompanied by a brief caption and a relative 
link to another document holding a longer description of the image. Finally the 
image name if preceded by an 's' links to a smaller icon sized version of the 
image which is always a png. We want SOLR to store fields containing the 
absolute link to the image, its icon and the full description. The following 
shows one way we could configure solrconfig.xml and DIH's data-config.xml to 
index this data.
:   
: @@ -751, +744 @@
: 
:   Each entity is handled by a default Entity processor called 
!SqlEntityProcessor. This works well for systems which use RDBMS as a 
datasource. For other kind of datasources like  REST or Non Sql datasources you 
can choose to extend this abstract class 
`org.apache.solr.handler.dataimport.Entityprocessor`. This is designed to 
Stream rows one by one from an entity. The simplest way to implement your own 
!EntityProcessor is to extend !EntityProcessorBase and override the `public 
Map<String,Object> nextRow()` method.
:   '!EntityProcessor' rely on the !DataSource for fetching data. The return 
type of the !DataSource is important for an !EntityProcessor. The built-in ones 
are,
:   
: + 
:   === SqlEntityProcessor ===
:   This is the defaut. The !DataSource must be of type 
`DataSource<Iterator<Map<String, Object>>>` . !JdbcDataSource can be used with 
this.
:   
: + 
:   === XPathEntityProcessor ===
: - Used when indexing XML type data. The !DataSource must be of type 
`DataSource<Reader>` . URLDataSource <!> [[Solr1.4]] or !FileDataSource is 
commonly used with XPathEntityProcessor.
: + Used when indexing XML type data. The !DataSource must be of type 
`DataSource<Reader>` . URLDataSource or !FileDataSource is commonly used with 
XPathEntityProcessor.
: + 
:   
:   === FileListEntityProcessor ===
: - A simple entity processor which can be used to enumerate the list of files 
from a File System based on some criteria. It does not use a !DataSource. The 
entity attributes are:
: + A simple entity processor which can be used to enumerate the list of files 
from a File System based on some criteria. The entity attributes are:
:    * '''`fileName`''' :(required) A regex pattern to identify files
:    * '''`baseDir`''' : (required) The Base directory (absolute path)
:    * '''`recursive`''' : Recursive listing or not. Default is 'false'
: @@ -766, +762 @@
: 
:    * '''`newerThan`''' : A date param . Use the format (`yyyy-MM-dd 
HH:mm:ss`) . It can also be a datemath string eg: ('NOW-3DAYS'). The single 
quote is necessary . Or it can be a valid variableresolver format like 
(${var.name})
:    * '''`olderThan`''' : A date param . Same rules as above
:    * '''`rootEntity`''' :It must be false for this (Unless you wish to just 
index filenames) An entity directly under the <document> is a root entity. That 
means that for each row emitted by the root entity one document is created in 
Solr/Lucene. But as in this case we do not wish to make one document per file. 
We wish to make one document per row emitted by the following entity 'x'. 
Because the entity 'f' has rootEntity=false the entity directly under it 
becomes a root entity automatically and each row emitted by that becomes a 
document.
: -  * '''`dataSource`''' :If you use Solr1.3 It must be set to "null" because 
this does not use any !DataSource. No need to specify that in Solr1.4 .It just 
means that we won't create a !DataSource instance. (In most of the cases there 
is only one !DataSource (A !JdbcDataSource) and all entities just use them. In 
case of !FileListEntityProcessor a !DataSource is not necessary.)
: +  * '''`dataSource`''' :If present it must be set to "null". In most of the 
other !EntityProcessor's there is only one !DataSource (A !JdbcDataSource) and 
they just use them by default. In case of !FileListEntityProcessor a 
!DataSource is not necessary.)
:   
:   example:
:   {{{
: @@ -812, +808 @@
: 
:   
:   === PlainTextEntityProcessor ===
:   <<Anchor(plaintext)>>
: - <!> [[Solr1.4]]
:   
:   This !EntityProcessor reads all content from the data source into an single 
implicit field called 'plainText'. The content is not parsed in any way, 
however you may add transformers to manipulate the data within 'plainText' as 
needed or to create other additional fields.
:   
: @@ -828, +823 @@
: 
:   
:   === LineEntityProcessor ===
:   <<Anchor(LineEntityProcessor)>>
: - <!> [[Solr1.4]]
:   
:   This !EntityProcessor reads all content from the data source on a line by 
line basis, a field called 'rawLine' is returned for each line read. The 
content is not parsed in any way, however you may add transformers to 
manipulate the data within 'rawLine' or to create other additional fields.
:   
: @@ -863, +857 @@
: 
:   }}}
:   and it can be used in the entities like a standard one
:   
: + 
:   === JdbcdataSource ===
:   This is the default. See the  [[#jdbcdatasource|example]] . The signature 
is as follows
:   {{{
: @@ -870, +865 @@
: 
:   }}}
:   It is designed to iterate rows in DB one by one. A row is represented as a 
Map.
:   
: + 
:   === URLDataSource ===
: - <!> [[Solr1.4]]
:   This datasource is often used with XPathEntityProcessor to fetch content 
from an underlying file:// or http:// location. See the documentation 
[[#httpds|here]] . The signature is as follows
:   {{{
:   public class URLDataSource extends DataSource<Reader>
:   }}}
:   
: - === HttpDataSource ===
: - <!> !HttpDataSource is being deprecated in favour of URLDataSource in 
[[Solr1.4]]. There is no change in functionality between URLDataSource and 
!HttpDataSource, only a name change.
: + <!> !HttpDataSource has been deprecated in favour of URLDataSource as of 
version [[Solr1.4]]. There is no change in functionality between URLDataSource 
and !HttpDataSource, only a name change.
: + 
:   
:   === FileDataSource ===
:   This can be used like an URLDataSource but used to fetch content from files 
on disk. The only difference from URLDataSource, when accessing disk files, is 
how a pathname is specified. The signature is as follows
: @@ -890, +885 @@
: 
:    * '''`basePath`''': (optional) The base path relative to which the value 
is evaluated if it is not absolute
:    * '''`encoding`''': (optional) If the files are to be read in an encoding 
that is not same as the platform encoding
:   
: + 
:   === FieldReaderDataSource ===
: - <!> [[Solr1.4]]
: - 
:   This can be used like an URLDataSource . The signature is as follows
:   {{{
:   public class FieldReaderDataSource extends DataSource<Reader>
: @@ -909, +903 @@
: 
:   }}}
:   
:   === ContentStreamDataSource ===
: - <!> [[Solr1.4]]
: - 
:   Use this to use the POST data as the !DataSource. This can be used with any 
!EntityProcessor that uses a !DataSource<Reader>
: + 
:   
:   == EventListeners ==
:   !EventListener can be registered for "onImportStart" and onImportEnd" .It 
must implement the interface  
[[http://svn.apache.org/viewvc/lucene/dev/trunk/solr/contrib/dataimporthandler/src/main/java/org/apache/solr/handler/dataimport/EventListener.java?view=markup|EventListener]].
: @@ -924, +917 @@
: 
:   </dataConfig>
:   }}}
:   
: + 
:   == Special Commands ==
:   Special commands can be given to DIH by adding certain variables to the row 
returned by any of the components .
:    * '''`$skipDoc`''' : Skip the current document . Do not add it to Solr. 
The value can be String true/false
:    * '''`$skipRow`''' : Skip the current row. The document will be added with 
rows from other entities. The value can be String true/false
:    * '''`$docBoost`''' : Boost the current doc. The value can be a number or 
the toString of a number
: -  * '''`$deleteDocById`''' : Delete a doc from Solr with this id. The value 
hast to be the unniqueKey value of the document <!> [[Solr1.4]]
: +  * '''`$deleteDocById`''' : Delete a doc from Solr with this id. The value 
hast to be the unniqueKey value of the document
: -  * '''`$deleteDocByQuery`''' :Delete docs from Solr by this query. The 
value must be a Solr Query <!> [[Solr1.4]]
: +  * '''`$deleteDocByQuery`''' :Delete docs from Solr by this query. The 
value must be a Solr Query
:   
:   
:   == Adding datasource in solrconfig.xml ==
: @@ -951, +945 @@
: 
:     </requestHandler>
:   }}}
:   <<Anchor(arch)>>
: + 
: + 
:   = Architecture =
:   The following diagram describes the logical flow for a sample configuration.
:   
: @@ -960, +956 @@
: 
:   There are 3 datasources two RDBMS (jdbc1,jdbc2) and one xml/http (B)
:   
:    * `jdbc1` and `jdbc2` are instances of  type `JdbcDataSource` which are 
configured in the solrconfig.xml.
: -  * `http` is an instance of type `HttpDataSource`
: +  * `http` is an instance of type `URLDataSource`
:    * The root entity starts with a table called 'A' and uses 'jdbc1' as the 
datasource . The entity is conveniently named as the table itself
:    * Entity 'A' has 2 sub-entities 'B' and 'C' . 'B' uses the datasource 
instance  'http' and 'C' uses the datasource instance 'jdbc2'
:    * On doing a `command=full-import` The root-entity (A) is executed first
: @@ -980, +976 @@
: 
:   == What is a row? ==
:   A row in !DataImportHandler is a Map (Map<String, Object>). In the map , 
the key is the name of the field and the value can be anything which is a valid 
Solr type. The value can also be a Collection of the valid Solr types (this may 
get mapped to a multi-valued field). If the !DataSource is RDBMS a query cannot 
emit a multivalued field. But it is possible to create a multivalued field by 
joining an entity with another.i.e if the sub-entity returns multiple rows for 
one row from parent entity it can go into a multivalued field. If the 
datasource is xml, it is possible to return a multivalued field.
:   
: + 
:   == A VariableResolver ==
:   A !VariableResolver is the component which replaces all those placeholders 
such as `${<name>}`. It is a multilevel Map.  Each namespace is a Map and 
namespaces are separated by periods (.) . eg if there is a placeholder 
${item.ID} , 'item' is a nampespace (which is a map) and 'ID' is a value in 
that namespace. It is possible to nest namespaces like ${item.x.ID} where x 
could be another Map. A reference to the current !VariableResolver can be 
obtained from the Context. Or the object can be directly consumed by using 
${<name>} in 'query' for RDMS queries or 'url' in Http .
:   === Custom formatting in query and url using Functions ===
:   While the namespace concept is useful , the user may want to put some 
computed value into the query or url for example there is a Date object and 
your datasource  accepts Date in some custom format . There are a few functions 
provided by the !DataImportHandler which can do some of these.
: -  * ''formatDate'' : It is used like this 
`'${dataimporter.functions.formatDate(item.ID, 'yyyy-MM-dd HH:mm')}'` . The 
first argument can be a valid value from the !VariableResolver and the second 
cvalue can be a a format string (use !SimpledateFormat) . The first argument 
can be a computed value eg: `'${dataimporter.functions.formatDate('NOW-3DAYS', 
'yyyy-MM-dd HH:mm')}'` and it uses the syntax of the datemath parser in Solr. 
(note that it must enclosed in single quotes) . <!> Note . This syntax has been 
changed in 1.4 . The second parameter was not enclosed in single quotes 
earlier. But it will continue to work without single quote also.
: +  * ''formatDate'' : It is used like this 
`'${dataimporter.functions.formatDate(item.ID, 'yyyy-MM-dd HH:mm')}'` . The 
first argument can be a valid value from the !VariableResolver and the second 
value can be a format string (use !SimpledateFormat) . The first argument can 
be a computed value eg: `'${dataimporter.functions.formatDate('NOW-3DAYS', 
'yyyy-MM-dd HH:mm')}'` and it uses the syntax of the datemath parser in Solr.
:    * ''escapeSql'' : Use this to escape special sql characters . eg : 
`'${dataimporter.functions.escapeSql(item.ID)}'`. Takes only one argument and 
must be a valid value in the !VaraiableResolver.
:    * ''encodeUrl'' : Us this to encode urls . eg : 
`'${dataimporter.functions.encodeUrl(item.ID)}'` . Takes only one argument and 
must be a valid value in the !VariableResolver
: + 
:   
:   ==== Custom Functions ====
:   It is possible to plug in custom functions into DIH. Implement an 
[[http://svn.apache.org/viewvc/lucene/dev/trunk/solr/contrib/dataimporthandler/src/main/java/org/apache/solr/handler/dataimport/Evaluator.java?view=markup|Evalutor]]
 and specify it in the data-config.xml . Following is an example of an 
evaluator which does a 'toLowerCase' on a String.
: 



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: [Solr Wiki] Update of "DataImportHandler" by FergusMcMenemie

Reply via email to