Hi Michael, Having faced the same limitation, I have found these two libraries to be helpful:
- Frameless (https://github.com/typelevel/frameless <https://github.com/typelevel/frameless>) - struct-type-encoder (https://benfradet.github.io/blog/2017/06/14/Deriving-Spark-Dataframe-schemas-with-Shapeless <https://benfradet.github.io/blog/2017/06/14/Deriving-Spark-Dataframe-schemas-with-Shapeless>) Both use Shapeless to derive Datasets. I hope it helps. Patrick. > On Nov 14, 2017, at 20:38, mlopez <michael.lopez....@gmail.com> wrote: > > Hello everyone! > > I'm a developer at a security ratings company. We've been moving to Spark > for our data analytics and nearly every dataset we have contains IP > addresses or variable-length subnets. Katherine's descriptions of use cases > and attempts to emulate networking types overlap with ours. I would add that > we also need to write complex queries over subnets in addition to IP > addresses. > > Has there been any update on this topic? > https://github.com/apache/spark/pull/16478 was last updated in February of > this year. > > I would also like to know if it would be better to work toward IP networking > types. Supposing Spark had UDT support, would it be just as good as built-in > support for networking types? Where would they fall short? Would it be > possible to pass custom rules catalyst for optimizing expressions with > networking types? > > We have to write complex joins over predicates like subnet containment and > have to resort to difficult to read tricks to ensure that Spark doesn't > resort to an inefficient join strategy. For example, it would be great to > simply write `df1.join(df2, contains($"src_net", $"dst_net")` to join > records from one dataset that have subnets that are contained in another. > > > > ----- > Michael Lopez > Cheerful Engineer! > -- > Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ > > --------------------------------------------------------------------- > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >