[
https://issues.apache.org/jira/browse/SOLR-7061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14297082#comment-14297082
]
Mark Peng edited comment on SOLR-7061 at 1/30/15 2:32 AM:
----------------------------------------------------------
[~noble.paul] Yes, the abstract class Context provides
setSessionAttribute(name, val, scope) to store row data in different levels. We
had tried to utilize this, but found some issues:
We want to cache as minimum data as possible for each function. ContextImpl
uses a HashMap<String, Object> to store document-level data in DocWrapper for
all entities, no matter if they will be used by any function or not, which may
cache unnecessary values in memory.
Only the variables used as function arguments of current ScriptTransformer are
required to be cached. Caches are removed right after the end of each document.
We choose a simpler design to keep a dedicated ordered map for each
ScriptTransformer, so they can maintain their own mapping of resolved variables
and function arguments using minor resource. An isResolved flag is used to
avoid resolving same variable multiple times if current entity has multiple
rows. The change scope of current implementation is also minimized (only
affects ScriptTransformer and EntityProcessorWrapper).
was (Author: markpeng):
[~noble.paul] Yes, the abstract class Context provides
setSessionAttribute(name, val, scope) to store row data in different levels. We
had tried to utilize this, but found some issues:
1. ContextImpl uses *HashMap<String, Object>* to store document-level data in
DocWrapper, but the order of keys will not be reserved. Since we are passing
function arguments, we need to ensure the order of them to map with the value
of resolved variables correctly.
In our design we use a dedicated *LinkedHashMap<String, Object>* for each
ScriptTransformer. Note that only the variables used as function arguments of
current ScriptTransformer are cached in LinkedHashMap. Caches are removed right
after the end of rows in the entity per document.
2. Transformer only defines a *transformRow(row, context)* function for each
entity itself, which limits the ScriptTransformer from getting access to
function arguments passed from other entities at that time. Even though we can
put function arguments into session attributes through context, *since
document-level session map is a mix of variables from all entities, we have no
idea which of them are selected for current script function* (the order of
values is missing after parsing arguments into a HashMap using *replaceToken()*
of VariableResovler).
So we choose a simpler design to keep *a dedicated ordered LinkedHashMap for
each ScriptTransformer*, so they can maintain their own mapping of resolved
variables and function arguments using minor resource. An isResolved flag is
used to avoid resolving same variable multiple times if current entity has
multiple rows. The change scope of current implementation is also minimized
(only affects ScriptTransformer and EntityProcessorWrapper).
> Cross-Entity Variable Resolving and Arguments for ScriptTransformer Functions
> -----------------------------------------------------------------------------
>
> Key: SOLR-7061
> URL: https://issues.apache.org/jira/browse/SOLR-7061
> Project: Solr
> Issue Type: Improvement
> Components: contrib - DataImportHandler
> Affects Versions: 4.10.3
> Reporter: Mark Peng
> Priority: Minor
> Labels: dataimport, transformers
> Attachments: SOLR-7061.patch
>
>
> Script Transformer has been widely used to modify the value of columns of
> selected rows from targeting data source (such as SQL Database) based on
> specific logics, before writing to Solr as documents. However, current
> implementation has the following limitations:
> *1. It is not possible to pass constant values or resolved variables (e.g.,
> $\{TABLE.COLUMN\} ) as arguments to a script function.*
> *2. Cross-entity row data exchange is not possible as well.*
> In our use case, we have complex nested entities and rely heavily on the
> script functions to transform table rows while doing data import. Sometimes
> for each single document, we need to get the selected column values from a
> parent entity into current entity for doing value transformation and applying
> if-else logics. To achieve this, we need to join with others tables in the
> SQL of current entity, which is quite resource-consuming, especially for
> large tables.
> Therefore, we have done some improvements to allow us to pass selected column
> values from entity A to another entity B as its function arguments by
> utilizing variable resolver.
> Here is an example about how it works. Suppose we have the following
> configuration:
> {code}
> <dataConfig>
> <dataSource name="ProductDB"
> driver="oracle.jdbc.driver.OracleDriver"
> url="jdbc:oracle:thin:@${dataimporter.request.host}:
>
> ${dataimporter.request.port}/${dataimporter.request.name}"
> user="${dataimporter.request.user}"
> password="${dataimporter.request.password}"
> autoCommit="true"/>
> <!-- ScriptTransformer functions -->
> <script><![CDATA[
> function processItemRow(row, resolvedVars) {
> var isOnSale = resolvedVars.get("${PRODUCT.IS_ONSALE}");
> var discount = resolvedVars.get("${PRODUCT.DISCOUNT_RATE}");
> var price = row.get("PRICE");
>
> if(isOnSale) {
> row.put("PRICE", price * discount);
> }
> else
> row.put("PRICE", price);
>
> return row;
> }
> ]]>
> </script>
> <document name="EC_SHOP">
> <entity dataSource="ProductDB" name="PRODUCT"
> query="SELECT PRODUCT_ID, TITLE, IS_ONSALE, DISCOUNT_RATE
> FROM PRODUCT">
> <field column="PRODUCT_ID" name="PRODUCT_ID"/>
> <field column="TITLE" name="TITLE"/>
> <field column="IS_ONSALE" name="IS_ONSALE"/>
> <field column="DISCOUNT_RATE" name="DISCOUNT_RATE"/>
>
>
> <entity dataSource="ProductDB" name="ITEM"
>
> transformer="script:processItemRow(${PRODUCT.IS_ONSALE},${PRODUCT.DISCOUNT_RATE})"
> query="SELECT PRICE FROM ITEM WHERE PRODUCT_ID =
> '${PRODUCT.PRODUCT_ID}'">
> <field column="PRICE" name="PRICE"/>
> </entity>
> </entity>
> </document>
> </dataConfig>
> {code}
> As demonstrated above, now we can get access to the value of column
> *IS_ONSALE* and *DISCOUNT_RATE* of table *PRODUCT* from the entity of table
> *ITEM* by passing *$\{PRODUCT.IS_ONSALE\}* and *$\{PRODUCT.DISCOUNT_RATE\}*
> as arguments of the function *processItemRow* to determine if we should give
> some discounts for the production price. The signature of function has a
> secondary argument (named *resolvedVars* here) for passing the map of column
> values resolved from other previous entities.
> This improvement gives more flexibility for script functions to exchange row
> data cross entities (even cross datasource) and do more complex processing
> for entity rows.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]